Lecture 26

Standard Haskell Data Structures

We're going to do a brief survey of some of the data structures provided by the Haskell Platform.

Data.Sequence

[] is wonderful, but sometimes we want a list-like data-structure with different characteristics, e.g., efficient access to both ends, and efficient concatenation. If we can accept the restrictions of finiteness and strict underlying operations, Data.Sequence provides this.

Data.ByteString

String is wonderful, but it comes with very large space overheads and associated inefficiencies, especially if we're dealing with strings of ASCII characters on 64-bit machines. A Bytestring is packed arrays of Word8's, i.e., raw bytes, suitable for high-performance/large data quantities. I've used ByteString to handle bioinformatics data-sets containing hundreds of megabytes of ASCII data. You wouldn't want to do this with ordinary Haskell strings, which take 20(!) bytes of memory to represent a single byte of data. There are basically two costs associated with Bytestring: first cons is expensive, as it involves recopying the array, the second is that non-ASCII characters cannot easily be represented. The default Data.ByteString class adds the further complication that the values of the array are not typed as characters. This often makes Data.ByteString.Char8 more convenient in practice.

Data.Text

Data.Text provides a more efficient representation of Unicode text than String, albeit at the cost of requiring you to learn the interface.

Data.Set

Sets are the basic abstraction of mathematics, and a useful data abstraction in programming. The Haskell Platform's Data.Set is a balanced-binary tree implementation of sets based on order (i.e., the elements of a Set have to be instances of the Ord typeclass). This module is meant to be imported qualified, as many of its functions clash with Prelude functions.

import Data.Set (Set) import qualified Data.Set as Set

The basic functions for constructing sets are

Where the power of sets really comes into play is operations that combine sets, e.g.,

There are a large number of other operations, e.g., member for element testing, findMin, deleteMin, analogous functions for max, null which tests a set for emptiness, and size which computes the number of elements in a set. A general rule of thumb is that before writing a new Set function, check and make sure that it hasn't already been implemented in Data.Set.

One of the disappointments with Set type is that Set is not a functor or a monad, nor can it be without considerable deviousness because of the constraint that the elements of a Set must be ordered. This is something that people are thinking about.

Functional programmers often use lists where sets might be more appropriate, mostly out of familiarity with lists. Where sets shine over lists or even ordered lists is in efficiency, but this is an asymptotic efficiency, i.e., the overheads of using a more efficient data structure may be disadvantageous at small problem sizes. Still, there are some small, common tasks where using sets can help. Consider the problem of constructing a function uniq: [a] -> [a]> such that every element of the argument appears exactly once in the result. We'll often use the function nub from Data.List to do this, but note that the documentation for nub says that this can require $O(n^2)$ time. Let's look at the implementation (note that the actual implementation defined nub in terms of nubBy, but this is faithful to the algorithm):

nub :: Eq a => [a] -> [a] nub [] = [] nub (x:xs) = x : nub (filter (/=x) xs)

If we're willing to accept the stronger constrain of Ord a, we obtain much the same effect by simply running the list through a set:

uniq :: Ord a => [a] -> [a] uniq = Set.fromList . Set.toList

A somewhat similar example/opportunity could be found in the Proposition class—the function variables :: (Eq a) => Prop a -> [a] practically begs to be re-written:

newtype Union a = Union { getUnion :: Set.Set a } instance Ord a => Monoid (Union a) where mempty = Union Set.empty mappend x y = Union (getUnion x `Set.union` getUnion y) variables :: Ord a => Prop a -> [a] variables = Set.fromList . getUnion . foldMap (Union . Set.singleton)

And certainly, Union with its instance Monoid definition looks like a very re-usable bit of code.

Exercise 26.1

Implement the analogous Intersection type, with it's own Monoid instance. Note that this isn't as simple as might be expected, because you need define mempty.

We will see more powerful uses of sets in the next lecture.

Data.Map

The Map type is very similar to the Set type, and indeed, the Map type can be thought of as a Set type in which the elements carry values with them. We've seen and used maps in various programs this quarter.

Data.IntSet, Data.IntMap

There are specialized versions of the Set and Map types for use in the common case where the elements/keys are of type Int, and the actual keys in use are a dense subset of an interval. This allows for an array-like implementation, with substantially better lookup efficiency.

Data.Array

Surprisingly, Haskell does support an array type, but remembering that Haskell is a pure language, there's an obvious issue associated with assignment, because somehow we have to end up with both the old array and the new one. The effect of this is that Haskell has an array type that supports very efficient build and look-up, but the cost of an assignment is the same as the cost of a build, i.e., linear in array size. This means that naïve translations of array-based algorithms into Haskell can be hopelessly inefficient, but... many array-based algorithms can be re-expressed as iterative builds, i.e., as a sequence of steps in which a single new array is built based on values from the old array. In such cases, the algorithms can be very efficient.

The basic array type is Array i e, where i is the index type, which must belong to the typeclass Ix, and is often Int. Note, though, that Ix is closed under tuples, e.g., there is an instance (Ix a, Ix b) => Ix (a,b), enabling the use of multidimensional arrays.

Note that the bounds of an array are not a part of its type, unlike other programming languages.

One of the places where Haskell arrays really shine is in implementing dynamic programs -- these are programs in which solutions to sub-problems are memoized, avoiding unnecessary recalculation. Consider the problem of counting the number of distinct binary trees with n nodes. This can be attacked directly via a simple recursive program:

countTrees:: Integer -> Integer countTrees n = if n == 0 then 1 else sum [ countTrees left * countTrees right | left <- [0..n-1] , let right = n-left-1 ]

There is only one tree with zero nodes, the empty tree. For non-empty trees, after fixing the top node, we consider each of the ways of dividing the remaining nodes between the left and right subtrees, and count each way of combining a distinct left subtree with a distinct right subtree.

The code here is conceptually quite clear, but unfortunately, it's not fast: computing countTrees 15 takes a few seconds on a modern machine, but countTrees 30 would require more than a century. Can we do better?

Our approach will be to fill an Array with values of the function, while using those same values to facilitate the calculation:

countTreesFast :: Integer -> Integer countTreesFast n = a ! n where a = array (0,n) [(i,ct i) | i <- [0..n]] ct n = if n == 0 then 1 else sum [ a ! left * a ! right | left <- [0..n-1] , let right = n-left-1 ]

Let's take this apart:

The performance characteristics of countTreesFast are very different from countTree. We can compute

> countTreesFast 100 896519947090131496687170070074100632420837521538745909320

almost instantaneously, and

> countTreesFast 1000

in a few seconds, obtaining a 598 digit answer.