Lecture 26
Standard Haskell Data Structures
We're going to do a brief survey of some of the data structures provided by the Haskell Platform.
Data.Sequence
[]
is wonderful, but sometimes we want a list-like data-structure with different characteristics, e.g., efficient access to both ends, and efficient concatenation. If we can accept the restrictions of finiteness and strict underlying operations, Data.Sequence
provides this.
Data.ByteString
String
is wonderful, but it comes with very large space overheads and associated inefficiencies, especially if we're dealing with strings of ASCII characters on 64-bit machines. A Bytestring
is packed arrays of Word8
's, i.e., raw bytes, suitable for high-performance/large data quantities. I've used ByteString
to handle bioinformatics data-sets containing hundreds of megabytes of ASCII data. You wouldn't want to do this with ordinary Haskell strings, which take 20(!) bytes of memory to represent a single byte of data. There are basically two costs associated with Bytestring
: first cons
is expensive, as it involves recopying the array, the second is that non-ASCII characters cannot easily be represented. The default Data.ByteString
class adds the further complication that the values of the array are not typed as characters. This often makes Data.ByteString.Char8
more convenient in practice.
Data.Text
Data.Text
provides a more efficient representation of Unicode text than String
, albeit at the cost of requiring you to learn the interface.
Data.Set
Sets are the basic abstraction of mathematics, and a useful data abstraction in programming. The Haskell Platform's Data.Set
is a balanced-binary tree implementation of sets based on order (i.e., the elements of a Set
have to be instances of the Ord
typeclass). This module is meant to be imported qualified, as many of its functions clash with Prelude
functions.
import Data.Set (Set)
import qualified Data.Set as Set
The basic functions for constructing sets are
empty :: Set a
, the empty set.singleton :: Ord a => a -> Set a
, create a one-element set.insert :: Ord a => a -> Set a -> Set a
, create a new set by inserting an element into an existing set.delete :: Ord a => a -> Set a -> Set a
, create a new set by deleting an element from an existing set.fromList :: Ord a => [a] -> Set a
, create a set from a list of elements.
Where the power of sets really comes into play is operations that combine sets, e.g.,
union :: Ord a => Set a -> Set a -> Set a
, create the union of two sets.intersection :: Ord a => Set a -> Set a -> Set a
, create the intersection of two sets.difference :: Ord a => Set a -> Set a -> Set a
, create the difference between two sets. The(\\)
operator is an alias fordifference
.
There are a large number of other operations, e.g., member
for element testing, findMin
, deleteMin
, analogous functions for max, null
which tests a set for emptiness, and size
which computes the number of elements in a set. A general rule of thumb is that before writing a new Set
function, check and make sure that it hasn't already been implemented in Data.Set
.
One of the disappointments with Set
type is that Set
is not a functor or a monad, nor can it be without considerable deviousness because of the constraint that the elements of a Set
must be ordered. This is something that people are thinking about.
Functional programmers often use lists where sets might be more appropriate, mostly out of familiarity with lists. Where sets shine over lists or even ordered lists is in efficiency, but this is an asymptotic efficiency, i.e., the overheads of using a more efficient data structure may be disadvantageous at small problem sizes. Still, there are some small, common tasks where using sets can help. Consider the problem of constructing a function uniq: [a] -> [a]
> such that every element of the argument appears exactly once in the result. We'll often use the function nub
from Data.List
to do this, but note that the documentation for nub
says that this can require $O(n^2)$ time. Let's look at the implementation (note that the actual implementation defined nub
in terms of nubBy
, but this is faithful to the algorithm):
nub :: Eq a => [a] -> [a]
nub [] = []
nub (x:xs) = x : nub (filter (/=x) xs)
If we're willing to accept the stronger constrain of Ord a
, we obtain much the same effect by simply running the list through a set:
uniq :: Ord a => [a] -> [a]
uniq = Set.fromList . Set.toList
A somewhat similar example/opportunity could be found in the Proposition
class—the function variables :: (Eq a) => Prop a -> [a]
practically begs to be re-written:
newtype Union a = Union { getUnion :: Set.Set a }
instance Ord a => Monoid (Union a) where
mempty = Union Set.empty
mappend x y = Union (getUnion x `Set.union` getUnion y)
variables :: Ord a => Prop a -> [a]
variables = Set.fromList . getUnion . foldMap (Union . Set.singleton)
And certainly, Union
with its instance Monoid
definition looks like a very re-usable bit of code.
Exercise 26.1
Implement the analogous Intersection
type, with it's own Monoid
instance. Note that this isn't as simple as might be expected, because you need define mempty
.
We will see more powerful uses of sets in the next lecture.
Data.Map
The Map
type is very similar to the Set
type, and indeed, the Map
type can be thought of as a Set
type in which the elements carry values with them. We've seen and used maps in various programs this quarter.
Data.IntSet
, Data.IntMap
There are specialized versions of the Set
and Map
types for use in the common case where the elements/keys are of type Int
, and the actual keys in use are a dense subset of an interval. This allows for an array-like implementation, with substantially better lookup efficiency.
Data.Array
Surprisingly, Haskell does support an array type, but remembering that Haskell is a pure language, there's an obvious issue associated with assignment, because somehow we have to end up with both the old array and the new one. The effect of this is that Haskell has an array type that supports very efficient build and look-up, but the cost of an assignment is the same as the cost of a build, i.e., linear in array size. This means that naïve translations of array-based algorithms into Haskell can be hopelessly inefficient, but... many array-based algorithms can be re-expressed as iterative builds, i.e., as a sequence of steps in which a single new array is built based on values from the old array. In such cases, the algorithms can be very efficient.
The basic array type is Array i e
, where i
is the index type, which must belong to the typeclass Ix
, and is often Int
. Note, though, that Ix
is closed under tuples, e.g., there is an instance (Ix a, Ix b) => Ix (a,b)
, enabling the use of multidimensional arrays.
One of the places where Haskell arrays really shine is in implementing dynamic programs -- these are programs in which solutions to sub-problems are memoized, avoiding unnecessary recalculation. Consider the problem of counting the number of distinct binary trees with n
nodes. This can be attacked directly via a simple recursive program:
countTrees:: Integer -> Integer
countTrees n =
if n == 0
then 1
else sum [ countTrees left * countTrees right
| left <- [0..n-1]
, let right = n-left-1
]
There is only one tree with zero nodes, the empty tree. For non-empty trees, after fixing the top node, we consider each of the ways of dividing the remaining nodes between the left and right subtrees, and count each way of combining a distinct left subtree with a distinct right subtree.
The code here is conceptually quite clear, but unfortunately, it's not fast: computing countTrees 15
takes a few seconds on a modern machine, but countTrees 30
would require more than a century. Can we do better?
Our approach will be to fill an Array
with values of the function, while using those same values to facilitate the calculation:
countTreesFast :: Integer -> Integer
countTreesFast n = a ! n where
a = array (0,n) [(i,ct i) | i <- [0..n]]
ct n = if n == 0
then 1
else sum [ a ! left * a ! right
| left <- [0..n-1]
, let right = n-left-1
]
Let's take this apart:
- We build an
Array
usingarray :: (Ix i) => (i,i) -> [(i,e)] -> Array i e
. The first argument(i,i)
is the bounds of the array, inclusive. The second argument[(i,e)]
is a list of associations. The list of associations don't need to be in ascending order, they don't need to be complete (i.e., some indices may be missing, in which case the corresponding element is undefined), or consistent (i.e., the same index may appear multiple times with distinct values, in which case the actual association is implementation-dependent). In this case, we've defined each value precisely once. - Note that the local
ct
function has a structure very similar to that of our originalcountTrees
function, excepting that array lookups replace recursive calls. It is possible to pursue this more systematically, but we won't now. - The use of
a
here depends heavily on the fact that we're working with a lazy language. The initial definition associates (evaluated) keys with unevaluated thunks. It is the lookup at the end (a ! n
) that drives the evaluation of these thunks.
The performance characteristics of countTreesFast
are very different from countTree
. We can compute
> countTreesFast 100
896519947090131496687170070074100632420837521538745909320
almost instantaneously, and
> countTreesFast 1000
in a few seconds, obtaining a 598
digit answer.