Simple data processing with Haskell

Published in

The Agile Monkeys’ Journey

6 min readMay 18, 2017

When doing data science, most of the time, we’re just cleaning up our data.

[…] Cleaning up data to the point where you can work with it is a huge amount of work. If you’re trying to reconcile a lot of sources of data that you don’t control like in this flight search example, it can take 80% of your time. [Source]

This task can be tedious, but it is very important.

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.” Jeffrey Heer

The good news is that it doesn’t have to be that bad, and Haskell can help.
If you don’t have enough experience with Haskell, you may have heard that it is a somewhat esoteric language. Others claim that it is the silver bullet for all problems.

Actually, it’s neither. We don’t believe in magic, but, we do think it’s an awesome language for data processing.

Let’s think that we want to help a small company solve a simple problem with their sales data:

Which categories of products do we have, and how many products of them have we sold. It doesn’t have to be perfect, we just want some guidance.

So we start working on their data. Let’s take a look at how the process works.

First of all, we loaded all the necessary packages and language extensions (which may look scary at first, but they enable some language goodies), and some packages like:

Flow — Allows the use of some nice operators like |>
Text — Makes text processing fast and easy
Frames — Generation of types for our data in compile time.

{-# LANGUAGE ConstraintKinds, DataKinds, FlexibleContexts, GADTs,
 OverloadedStrings, PatternSynonyms, QuasiQuotes,
 ScopedTypeVariables, TemplateHaskell, TypeOperators,
 ViewPatterns #-}
 
 import qualified Data.Text as Text
 import qualified Data.Text.IO as Text
 import Control.Applicative
 import qualified Control.Foldl as Foldl
 import qualified Data.Foldable as Foldable
 import Data.Char
 import Data.Proxy (Proxy(..))
 import Data.List
 import Lens.Family
 import Frames
 import Frames.CSV (readTableOpt, rowGen, RowGen(..))
 import Pipes hiding (Proxy)
 import qualified Pipes.Prelude as Pipes
 import Flow

Our data looks like this. (Dropping some superfluous columns):

id, create_date, name_template, create_uid, qty 
1434, 2016–04–01 00:33:21, Notebook with pen, Paw Patrol, 1, 1
1437, 2016–04–01 00:33:21, Small notebook, Frozen, 1, 1

Let’s begin by loading them in our program:

tableTypes "Sale" "resources/order_line_products_from_april_to_august.csv"
 
loadSales :: IO (Frame Sale)
loadSales = "resources/order_line_products_from_april_to_august.csv"
 |> readTableOpt saleParser
 |> inCoreAoS

So, what happened here? Let’s see:

tableTypes is a Template Haskell function, which means that it is executed at compile time. It generates a data type for our CSV, so we have everything under control with our types.

loadSales :: IO (Frame Sale) — In Haskell, we read the double colon as -has type , so in this example we say that loadSales has type IO (Frame Sale) . Let’s analyze this type in a bit more depth:

- Sale is the type that tableTypes generated for us, which is each row.
- Frame is a wrapper type that denotes a data frame containing some rows.
- IO is another wrapper type that indicates that it’s contents have been loaded from somewhere outside of the program. In this case, they came from a CSV file.

loadSales reads the table of the CSV file in memory (“core”) as an Array of Structures (inCoreAoS ).

In Haskell, we do our business like this, explicitly telling the compiler what we are handling in each step. In most cases, the compiler will tell us when we are doing something wrong with our type, like defining a function that is declared when receives a True as a parameter, but not a False , it will sometimes even tell us what do we have to write.

Frame and IO are wrapper types. Most of these wrappers have a function in common, called fmap which basically applies some function that you pass as a parameter to it’s contents. Sounds familiar? List is a wrapper type, and has the map function that applies a function that you pass as a parameter to its contents. It’s exactly the same thing. You can even use fmap on a List:

map (+1) [1, 2, 3, 4] == fmap (+1) [1, 2, 3, 4]

So in our example, we can now begin to get all the names from our data frame:

namesFromFrame :: Frame Sale -> [Text]
namesFromFrame saleFrame = 
 saleFrame
 |> fmap (view nameTemplate)
 |> Foldable.toList

Here we say that namesFromFrame takes a Frame as a parameter and returns a list of Text . We name the first parameter saleFrame .

Now we take saleFrame and extract nameTemplate from each row, resulting in a Frame Text . After that, we convert it to a [Text] .

Let’s clean up our names from stopwords and any unnecessary stuff, leaving only the interesting words:

cleanNameIntoWords :: [Text] -> [Text] -> Text -> [Text]
cleanNameIntoWords languageWords stopWords nameToClean = 
 nameToClean
 |> Text.toLower
 |> Text.filter isAlphaOrSeparator
 |> Text.words
 |> (\\ stopWords)
 |> filter (`elem` languageWords)
 |> filter notSpuriousWord
where
 isAlphaOrSeparator c = isAlpha c || isSeparator
 notSpuriousWord w = Text.length w > 2

cleanNameIntoWords accepts three arguments:

A list of texts that has all the words in a language
A list of texts that includes all the stopwords of that language
The name we want to clean

After that we apply the following process to the name:

We make it lowercase
We leave characters that aren’t spaces or alphanumerics
We convert it into a list of words
We remove the stopwords from it. ( \\ means set difference)
We leave the words that are from that language
We remove words that have fewer than three characters

Pretty straightforward, right?

Now let’s get all the vocabulary from the list of those word lists:

vocabulary :: [[Text]] -> [Text]
vocabulary wordList =
 wordList
 |> concat
 |> nub

We take the list of lists, flatten it using concat and remove all duplicate elements using nub .

Let’s count all of this now:

occurrencesOf :: Text -> [Text] -> Int
occurrencesOf word txt =
 txt
 |> filter (== word)
 |> length
 
countedVocabulary :: [[Text]] -> [(Text, Int)]
countedVocabulary s =
 vocabulary s
 |> map appearanceTimes
 |> sort
 where
 appearanceTimes x = (x, occurrencesOf x (concat s))

We build a vocabulary from the list of lists and we make each element of that list a tuple, that tells us how many times it appears there as the second element, e.g.: ("Notebook", 14). Finally, we sort it for convenience.

In order to get the stopwords and language words we have to load them as lists:

fileAsList :: FilePath -> IO [Text]
fileAsList path = do
 fileContents <- Text.readFile path
 return (Text.words fileContents)

Finally we can write our main function so we can execute our code:

main :: IO ()
main = do
 salesFrame <- loadSales
 salesNames <- namesFromFrame
 stopWords <- fileAsList "resources/stopwords.txt"
 languageWords <- fileAsList "resources/language-words.txt"
 let cleanedData = map (cleanNameIntoWords languageWords stopWords) salesNames
 print (countedVocabulary cleanedData)

When executed, will print something like:

[ (“notebook”, 15)
 , (“pen”, 19)
 , (“shirt”, 44)
 , …
 ]

We use Haskell because it’s a powerful language, with powerful tools. We know that it is not massively adopted for data science, but we are working hard on it, so it could be a reality soon. If you wanna help on this task, let’s talk about it.

What we saw here is just a grain of sand in the desert, and Haskell can do much more powerful things. For example, we could make all of our code run in the GPU by making just a few little changes to it (see the Accelerate library). We can use markdown to write our code, like you can do in Haskell.do, and many more incredible things.

What we cannot ignore is the fact that functional programming is becoming much more important nowadays, Haskell one of the best of those languages.

In JavaScript, you feel like an expert, but then, nothing works.
In Haskell, you feel like an amateur, but then, everything works.

Thanks for reading this article — stay tuned for more!

Simple data processing with Haskell

Written by Nick Tchayka