Note: Sets, Maps, and Sequences

Lists are often thought of as the defining data structure of functional programs.

Of course, just because certain useful data structures can be implemented via lists, doesn't mean that they should be. Let's consider a simple problem. We have two text files, both consisting of about 100K telephone numbers, in standard North American format (xxx) xxx-xxxx. We want to know which programs are in both. A naïve solution is simply to read in both files, and then search the second file for occurrences of each of the phone numbers in the first. This works, but it's slow, about 7 1/2 minutes on my iMac.

The inefficiency is algorithmic — unordered lists make operations like intersection inefficient. A fairly simple change is to sort the lists first, and then compute their intersection more efficiently:

orderedIntersect :: Ord a => [a] -> [a] -> [a] orderedIntersect [] _ = [] orderedIntersect _ [] = [] orderedIntersect (a:as) (b:bs) = case compare a b of LT -> orderedIntersect as (b:bs) EQ -> a : orderedIntersect as bs GT -> orderedIntersect (a:as) bs

This works, and runs in little more than a second. It's a large win for a small amount of code.

But this isn't especially pretty, and there's an issue lurking in the weeds here. Consider our (simple) main function:

main :: IO () main = do [arg1,arg2] <- getArgs ph1 <- sort . lines <$> readFile arg1 ph2 <- sort . lines <$> readFile arg2 putStr . unlines $ orderedIntersect ph1 ph2

Our intent is that the lists that are processed by orderedIntersect are ordered. But there's nothing about the type of orderedIntersect that enforces this constraint. What if we forgot? The good news is that it runs much faster. We're not sorting, after all. The bad news is that it produces no output. The worse news is that it complied in the first place. Our type system didn't help us at all. Is there some way to talk about finite, ordered collections? Welcome to the Set type constructor.

Data.Set

The Data.Set module in the containers package defines a Set data type. This type is opaque, but is easy to work with. Set supports conversion to and from [], common operations like union, difference, intersection, etc., and belongs (under appropriate assumptions) to common type classes like Eq, Ord, Foldable, and Monoid.

In the case of our phone-number intersection problem, we have the following complete solution:

module Main where import Data.Set (Set) import qualified Data.Set as Set import System.Environment ( getArgs ) main :: IO () main = do [arg1,arg2] <- getArgs ph1 <- Set.fromList . lines <$> readFile arg1 ph2 <- Set.fromList . lines <$> readFile arg2 putStr . unlines . Set.toList $ ph1 `S.intersection` ph2

A mildly disturbing aspect of this code is the swizzling between Set and [] based types, via Set.toList and Set.fromList. We can often avoid much of this by using the Foldable instance of Set, e.g., the last line could be written

putStr . foldMap (++"\n") $ ph1 `S.intersection` ph2

relying on the fact that the <> operaation for String is (++). Such opportunities become clear with both practice and intention, considering each of the places where we convert from one container representation to another, and seeking alternatives to conversion.

The complexity theoretic aspects of what's happening here are worth considering. Our original implementation compared each element of one list with each element of another. If the first list has $m$ elements, and the second has $n$, the overall complexity is $O(mn)$. The ordered approaches involved either explicit or implicit sorts (which are $O(m \lg m)$), for an overall complexity of $O(m \lg m + n \lg n)$. If we assume $m=n=10^5$, we're looking at roughly $10^6$ comparisons for the ordered code, vs. $10^{10}$ for the unordered code. Factors of $10,000$ matter. The underlying implementation is based on balanced, ordered, binary trees, in which the height of a tree with $n$ elements is $O(\lg n)$. [Note here that $\lg$ is the base-2 logarithm. Of course, if a log function lives under a big-O, the base doesn't matter, so the use of $\lg$ rather than $\log$ or $\ln$ is pedantic rather than precise.]

An important note: Haskell sets, like Mathematical sets, don't account for multiplicity. Elements either belong or they don't. Thus, sorting a list isn't the same as S.toList . S.fromList, as the later suppresses duplicates.

The specific form of the import statements here is worth noting. The types in the container package often have names that clash with Prelude functions, and using qualified names is usually the best way to avoid ambiguity, hence the qualified import of Data.Set. That said, having to write the Set data constructor as Set.Set is annoying, and lends nothing to clarity. Since it does not clash with other names, an unqualified import of that specific name doesn't create ambiguity, and is preferred.

Data.Map

Abstractly, a map is a finite, partial function from a domain type of keys to a range type of values. A common implementation of maps is via an association list:

type AssociationList k v = [(k,v)]

Function application in this setting is often handled using the Prelude function lookup: Eq a => a -> [(a,b)] -> Maybe b, where the Maybe wrapper in the range enables exception-free handling of non-totality. The presence of lookup in the Prelude hints at how often maps are used in real-world programming, and how often they can be adequately represented via an association list.

Somewhat farther afield, in other programming languages, maps are often called dictionaries, and are a core abstraction of those languages. Indeed, in Python, not only are dictionaries an important data structure in their own right, the objects of the language are essentially just dictionaries with an alternative syntax, and can be thought of as a map from field names to values. Object-oriented programming is largely an exercise in implicit dictionary manipulation, and so its techniques can be carried over to any language with decent support for dictionaries at the cost of some syntactic noise.

As an implementation of maps, association lists are subject to very much the same sort of efficiency (or perhaps more accurately, inefficiency) considerations as an unordered set representation, and it's useful to think of maps as being essentially sets of their keys, in which the values are being carried along for the ride. This implies the possibility of more efficient representations, e.g., ordering the list by keys, or even a balanced binary-tree representation. The later is provided by the Map class of Data.Map.

Like Set, Map is a type constructor, whose types belong to a number of standard type classes, assuming appropriate key and value types, e.g., Eq, Ord, Functor, Foldable, and Traversable. Much like Set, there's a tendency when starting out to rely overly much on conversions between Map values and association lists, via Map.fromList and Map.toList. With practice and intention, as with Set, such conversions can often be avoided by appropriate use of the Functor, Foldable, and Traversable instances.

Let's consider a simple example, a word counting program:

module Main where import Data.Char ( isAlpha ) import Data.Foldable ( for_ ) import Data.Map ( Map ) import qualified Data.Map as Map import Text.Printf ( printf) clean :: Char -> Char clean c | isAlpha c = c | otherwise = ' ' counts :: [String] -> Map String Int counts = foldr f Map.empty where f w m = case Map.lookup w m of Nothing -> Map.insert w 1 m Just c -> Map.insert w (c+1) m main :: IO () main = do ws <- words . map clean <$> getContents for_ (Map.toList . counts $ ws) $ \(k,v) -> do putStrLn $ printf "%-20s %5d" k v

The idea is that we'll process standard input, returning a two-column output of (word,frequency) pairs. I hasten to add that this is very naïve code, and we'll see some significant improvements soon.

The first line of main reads stdin, returning a String. We then map over that String, sending all non-letter characters to a space, and then use words to break the result into a list of String, each a non-empty run of letters. With then do a foldr over the resulting list of words, and starting from the empty map, we increment the count of each word, adding the word to the map if it has not previously occurred.

This is a fairly direct translation of what might be written in a language like Python.

A first simplification is in noting that Data.Map has a findWithDefault function, which we can use to eliminate the pattern match in counts.f:

counts :: [String] -> Map String Int counts = foldr f Map.empty where f w m = Map.insert w (Map.findWithDefault 0 w m + 1) m

This code might be criticized on the basis that we have to process m twice to increment a word's count, once to look up the value, and a second time to set it. This double processing can be eliminated using Map.alter, but there is a better way.

Data.Map contains a number of functions for combining maps. The default method, associated with union or its infix equivalent <> is left-biased selection, i.e., the combined map will return the value from the left-hand argument (if present), and if not, from the right-hand argument. This doesn't help us. But unionWith does help us -- if only one value is present in the left or right map, it is used, but if two are, they are combined via a user-specified function.

This enables a rather different, and to my mind significantly cleaner and more idiomatic, approach:

fmt :: String -> Int -> String fmt k v = printf "%-20s %5d\n" k v main :: IO () main = do ws <- words . map clean <$> getContents let m = Map.unionsWith (+) . map (`Map.singleton` 1) $ ws putStr . Map.foldMapWithKey fmt $ m

We build a list of words as before, but rather than iterating through them one-by-one, we map them to a list of singleton maps, in which each word has an associated value (count) of one. Then this list of maps is combined into a single map using Map.unionsWith (+), eliminating the need for foldr. Note the types here:

unionWith :: (Ord k) => (v -> v -> v) -> Map k v -> Map k v -> Map k v unionsWith :: Ord k => (v -> v -> v) -> [Map k v] -> Map k v

If we wanted to, we could go “golfing” from here, i.e., converting this to a one-line program by eliminating the variables ws and m (at the cost of introducing a >>=), but that's just showing off. What we have here is clean, and easy to maintain. The fmt function should be noted here. It would be nice to eliminate this definition, but we quickly run into the problem that printf is so polymorphic that the type checker is easily confused. We could resolve ourselves to fight the type checker, but why? The definition does no harm.

Exercise 1 Listing word counts alphabetically is sometimes (but not usually) what we want. Our interest is usually in finding frequent (or infrequent) words, and this argues for sorting the output from high-to-low frequency. Modify this program to do so.

*Exercise 2 Consider the problem of trying find a specific, partially remembered, quotation in a printed book. These days, in a world where books are often available in machine readable form, we might just search for a key word or phrase. But you can't do that with a printed book.

A number of tools were invented over the years to facilitate such searches: tables of contents and figures, indexes of various sorts.

Write a concordance program. This program should read a text file from stdin, and print a table (in alphabetical order) of words, together with the lines in which they occur.

A natural data structure to use is

type Concordance = Map String (Set Int)

as the use of Set rather than [] automatically handles the suppression of duplicates.

Include a test run of your program acting on its own source.

It is worth noting two alternative types. The type IntMap in the Data.IntMap package is specialized to work with keys of type Int, and offers better performance than Map in that case. The type HashMap in Data.HashMap.Lazy (or String) in the unordered-containers offers IntMap-like performance when the key type is a member of the Hashable type class. Note that IntSet and HashSet are also available.

Data.Sequence

The [] type is extremely useful, and it's efficient if we want to build a list from back to front, process the first few elements of a list, or process an entire list from back to front or front to back. Unfortunately, there are things that ordinary lists can be inefficient at: building a list from front to back, processing the last few elements of a list, or concatenating two or more lists into a list. Various programming tricks have evolved to deal with these issues, e.g., using functional composition rather than prompt construction of a list to allow mixed front-to-back and back-to-front construction. These ideas are often elegant, but a bit opaque. What is needed is a list-like data structure that provides efficient support for a broader range of operations. Welcome to the Seq data type in Data.Sequence.

Note that Seq belongs to all of the expected type classes for a list-like type: Functor, Applicative, Alternative, Monad, MonadFail, MonadPlus, Foldable, Traversable, SemiGroup, and Monoid. Whew! But also, very nice. This means that converting list-based code to Seq-based is often very easy, especially if type class based operations are used rather than list-specific ones, e.g., (<>), rather than append or (++).