Note: Sets, Maps, and Sequences
Lists are often thought of as the defining data structure of functional programs.
- The first commonly used functional programming language was Lisp, a portmanteau for "List Processor,"
- The recursive structure of lists facilitates functional approaches to algorithms, and
- Many useful data structures can be implemented via lists.
Of course, just because certain useful data structures can be implemented via lists, doesn't mean that they should be. Let's consider a simple problem. We have two text files, both consisting of about 100K telephone numbers, in standard North American format (xxx) xxx-xxxx. We want to know which programs are in both. A naïve solution is simply to read in both files, and then search the second file for occurrences of each of the phone numbers in the first. This works, but it's slow, about 7 1/2 minutes on my iMac.
The inefficiency is algorithmic — unordered lists make operations like intersection inefficient. A fairly simple change is to sort the lists first, and then compute their intersection more efficiently:
orderedIntersect :: Ord a => [a] -> [a] -> [a]
orderedIntersect [] _ = []
orderedIntersect _ [] = []
orderedIntersect (a:as) (b:bs) = case compare a b of
LT -> orderedIntersect as (b:bs)
EQ -> a : orderedIntersect as bs
GT -> orderedIntersect (a:as) bs
This works, and runs in little more than a second. It's a large win for a small amount of code.
But this isn't especially pretty, and there's an issue lurking in the weeds here. Consider our (simple) main
function:
main :: IO ()
main = do
[arg1,arg2] <- getArgs
ph1 <- sort . lines <$> readFile arg1
ph2 <- sort . lines <$> readFile arg2
putStr . unlines $ orderedIntersect ph1 ph2
Our intent is that the lists that are processed by orderedIntersect
are ordered. But there's nothing about the type of orderedIntersect
that enforces this constraint. What if we forgot? The good news is that it runs much faster. We're not sorting, after all. The bad news is that it produces no output. The worse news is that it complied in the first place. Our type system didn't help us at all. Is there some way to talk about finite, ordered collections? Welcome to the Set
type constructor.
Data.Set
The Data.Set
module in the containers
package defines a Set
data type. This type is opaque, but is easy to work with. Set
supports conversion to and from []
, common operations like
union
, difference
, intersection
, etc., and belongs (under appropriate assumptions) to common type classes like Eq
, Ord
, Foldable
, and Monoid
.
In the case of our phone-number intersection problem, we have the following complete solution:
module Main where
import Data.Set (Set)
import qualified Data.Set as Set
import System.Environment ( getArgs )
main :: IO ()
main = do
[arg1,arg2] <- getArgs
ph1 <- Set.fromList . lines <$> readFile arg1
ph2 <- Set.fromList . lines <$> readFile arg2
putStr . unlines . Set.toList $ ph1 `S.intersection` ph2
A mildly disturbing aspect of this code is the swizzling between Set
and []
based types, via Set.toList
and Set.fromList
. We can often avoid much of this by using the Foldable
instance of Set
, e.g., the last line could be written
putStr . foldMap (++"\n") $ ph1 `S.intersection` ph2
relying on the fact that the <>
operaation for String
is (++)
. Such opportunities become clear with both practice and intention, considering each of the places where we convert from one container representation to another, and seeking alternatives to conversion.
The complexity theoretic aspects of what's happening here are worth considering. Our original implementation compared each element of one list with each element of another. If the first list has $m$ elements, and the second has $n$, the overall complexity is $O(mn)$. The ordered approaches involved either explicit or implicit sorts (which are $O(m \lg m)$), for an overall complexity of $O(m \lg m + n \lg n)$. If we assume $m=n=10^5$, we're looking at roughly $10^6$ comparisons for the ordered code, vs. $10^{10}$ for the unordered code. Factors of $10,000$ matter. The underlying implementation is based on balanced, ordered, binary trees, in which the height of a tree with $n$ elements is $O(\lg n)$. [Note here that $\lg$ is the base-2 logarithm. Of course, if a log function lives under a big-O, the base doesn't matter, so the use of $\lg$ rather than $\log$ or $\ln$ is pedantic rather than precise.]
An important note: Haskell sets, like Mathematical sets, don't account for multiplicity. Elements either belong or they don't. Thus, sorting a list isn't the same as S.toList . S.fromList
, as the later suppresses duplicates.
The specific form of the import
statements here is worth noting. The types in the container
package often have names that clash with Prelude
functions, and using qualified names is usually the best way to avoid ambiguity, hence the qualified import of Data.Set
. That said, having to write the Set
data constructor as Set.Set
is annoying, and lends nothing to clarity. Since it does not clash with other names, an unqualified import of that specific name doesn't create ambiguity, and is preferred.
Data.Map
Abstractly, a map is a finite, partial function from a domain type of keys to a range type of values. A common implementation of maps is via an association list:
type AssociationList k v = [(k,v)]
Function application in this setting is often handled using the Prelude
function lookup: Eq a => a -> [(a,b)] -> Maybe b
, where the Maybe
wrapper in the range enables exception-free handling of non-totality. The presence of lookup
in the Prelude
hints at how often maps are used in real-world programming, and how often they can be adequately represented via an association list.
Somewhat farther afield, in other programming languages, maps are often called dictionaries, and are a core abstraction of those languages. Indeed, in Python
, not only are dictionaries an important data structure in their own right, the objects of the language are essentially just dictionaries with an alternative syntax, and can be thought of as a map from field names to values. Object-oriented programming is largely an exercise in implicit dictionary manipulation, and so its techniques can be carried over to any language with decent support for dictionaries at the cost of some syntactic noise.
As an implementation of maps, association lists are subject to very much the same sort of efficiency (or perhaps more accurately, inefficiency) considerations as an unordered set representation, and it's useful to think of maps as being essentially sets of their keys, in which the values are being carried along for the ride. This implies the possibility of more efficient representations, e.g., ordering the list by keys, or even a balanced binary-tree representation. The later is provided by the Map
class of Data.Map
.
Like Set
, Map
is a type constructor, whose types belong to a number of standard type classes, assuming appropriate key and value types, e.g., Eq
, Ord
,
Functor
, Foldable
, and Traversable
. Much like Set
, there's a tendency when starting out to rely overly much on conversions between Map
values and association lists, via Map.fromList
and Map.toList
. With practice and intention, as with Set
, such conversions can often be avoided by appropriate use of the Functor
, Foldable
, and Traversable
instances.
Let's consider a simple example, a word counting program:
module Main where
import Data.Char ( isAlpha )
import Data.Foldable ( for_ )
import Data.Map ( Map )
import qualified Data.Map as Map
import Text.Printf ( printf)
clean :: Char -> Char
clean c
| isAlpha c = c
| otherwise = ' '
counts :: [String] -> Map String Int
counts = foldr f Map.empty where
f w m = case Map.lookup w m of
Nothing -> Map.insert w 1 m
Just c -> Map.insert w (c+1) m
main :: IO ()
main = do
ws <- words . map clean <$> getContents
for_ (Map.toList . counts $ ws) $ \(k,v) -> do
putStrLn $ printf "%-20s %5d" k v
The idea is that we'll process standard input, returning a two-column output of (word,frequency) pairs. I hasten to add that this is very naïve code, and we'll see some significant improvements soon.
The first line of main
reads stdin
, returning a String
. We then map over that String
, sending all non-letter characters to a space, and then use words
to break the result into a list of String
, each a non-empty run of letters. With then do a foldr
over the resulting list of words, and starting from the empty map, we increment the count of each word, adding the word to the map if it has not previously occurred.
This is a fairly direct translation of what might be written in a language like Python.
A first simplification is in noting that Data.Map
has a findWithDefault
function, which we can use to eliminate the pattern match in counts.f
:
counts :: [String] -> Map String Int
counts = foldr f Map.empty where
f w m = Map.insert w (Map.findWithDefault 0 w m + 1) m
This code might be criticized on the basis that we have to process m
twice to increment a word's count, once to look up the value, and a second time to set it. This double processing can be eliminated using Map.alter
, but there is a better way.
Data.Map
contains a number of functions for combining maps. The default method, associated with union
or its infix equivalent <>
is left-biased selection, i.e., the combined map will return the value from the left-hand argument (if present), and if not, from the right-hand argument. This doesn't help us. But unionWith
does help us -- if only one value is present in the left or right map, it is used, but if two are, they are combined via a user-specified function.
This enables a rather different, and to my mind significantly cleaner and more idiomatic, approach:
fmt :: String -> Int -> String
fmt k v = printf "%-20s %5d\n" k v
main :: IO ()
main = do
ws <- words . map clean <$> getContents
let m = Map.unionsWith (+) . map (`Map.singleton` 1) $ ws
putStr . Map.foldMapWithKey fmt $ m
We build a list of words as before, but rather than iterating through them one-by-one, we map them to a list of singleton maps, in which each word has an associated value (count) of one. Then this list of maps is combined into a single map using Map.unionsWith (+)
, eliminating the need for foldr
. Note the types here:
unionWith :: (Ord k) => (v -> v -> v) -> Map k v -> Map k v -> Map k v
unionsWith :: Ord k => (v -> v -> v) -> [Map k v] -> Map k v
If we wanted to, we could go “golfing” from here, i.e., converting this to a one-line program by eliminating the variables ws
and m
(at the cost of introducing a >>=
), but that's just showing off. What we have here is clean, and easy to maintain. The fmt
function should be noted here. It would be nice to eliminate this definition, but we quickly run into the problem that printf
is so polymorphic that the type checker is easily confused. We could resolve ourselves to fight the type checker, but why? The definition does no harm.
Exercise 1 Listing word counts alphabetically is sometimes (but not usually) what we want. Our interest is usually in finding frequent (or infrequent) words, and this argues for sorting the output from high-to-low frequency. Modify this program to do so.
*Exercise 2 Consider the problem of trying find a specific, partially remembered, quotation in a printed book. These days, in a world where books are often available in machine readable form, we might just search for a key word or phrase. But you can't do that with a printed book.
A number of tools were invented over the years to facilitate such searches: tables of contents and figures, indexes of various sorts.
Write a concordance program. This program should read a text file from stdin
, and print a table (in alphabetical order) of words, together with the lines in which they occur.
A natural data structure to use is
type Concordance = Map String (Set Int)
as the use of Set
rather than []
automatically handles the suppression of duplicates.
Include a test run of your program acting on its own source.
It is worth noting two alternative types. The type IntMap
in the Data.IntMap
package is specialized to work with keys of type Int
, and offers better performance than Map
in that case. The type HashMap
in Data.HashMap.Lazy
(or String
) in the unordered-containers
offers IntMap
-like performance when the key type is a member of the Hashable
type class. Note that IntSet
and HashSet
are also available.
Data.Sequence
The []
type is extremely useful, and it's efficient if we want to build a list from back to front, process the first few elements of a list, or process an entire list from back to front or front to back. Unfortunately, there are things that ordinary lists can be inefficient at: building a list from front to back, processing the last few elements of a list, or concatenating two or more lists into a list. Various programming tricks have evolved to deal with these issues, e.g., using functional composition rather than prompt construction of a list to allow mixed front-to-back and back-to-front construction. These ideas are often elegant, but a bit opaque. What is needed is a list-like data structure that provides efficient support for a broader range of operations. Welcome to the Seq
data type in Data.Sequence
.
Note that Seq
belongs to all of the expected type classes for a list-like type: Functor
, Applicative
, Alternative
, Monad
, MonadFail
, MonadPlus
, Foldable
, Traversable
, SemiGroup
, and Monoid
. Whew! But also, very nice. This means that converting list-based code to Seq
-based is often very easy, especially if type class based operations are used rather than list-specific ones, e.g., (<>)
, rather than append
or (++)
.