Lecture 26: The Path Less Taken

Administrivia

There will be a review session during Professor Kurtz's regular class meeting time on Friday. Come with questions — no new material will be introduced.
Final Exam (Kurtz): Mon Dec 7, 1:30pm-3:30pm, Ryerson 276
Final Exam (Chugh): Wed Dec 9, 10:30am-12:30pm, Ryerson 276

As the final lecture of this course, I'd like to note that you stand at a fork in the road. There will be courses in functional programming available to you in the future, and perhaps even a more advanced course in Haskell. But it is possible for you to avoid functional programming, and many of you will chose to do so. We hope that if this is your last experience with functional programming, you'll still have benefited from it, and that it will color the way you think about programming going forward. Certainly, it's the case the traditional languages are in the midst of embracing functional constructs, and what you've learned here almost certainly will give you a leg up going forward.

But today is really for the rest of you: those who've enjoyed the expressiveness and power of functional programming, and who want to pursue the subject further. There are a number of courses offered here of possible interest, which somehow use or involve functional programming:

CMSC 22100 — Programming Languages
CMSC 22300 — Functional Programming
CMSC 22311 — Functional Systems in Haskell
CMSC 22610 — Implementation of Computer Languages, I
CMSC 22620 — Implementation of Computer Languages, II

There are also many opportunities for self-study, and Haskell in particular seems to have inspired a rich, but generally accessible, technical literature, along with a large number of web articles, tutorials, etc. And there is always the possibility of using Haskell in your own projects, and gathering expertise thereby. We want to give you a few landmarks to help guide your journey.

Standard Modules

The Haskell Platform ships with a large variety of modules, some a part of GHC's core, along with others that have proven to be useful to the Haskell community generally. Some of these modules provide access to specialized capabilities, e.g., network or database access. But there is a large number that provide quality implementations of more advanced data structures than we've encountered in this course.

Indeed, we've relied very heavily on a few data structures: lists, tuples, a few standard monads, our own (often naïve) algebraic data types. These have been "good enough" for our immediate pedagogical purposes, which include creating a natural familiarity with these types, but there are often good reasons to consider more sophisticated types, and the modules that support them. To a certain extent, knowing these "good reasons" is why we require CS majors to take a course in Algorithms, which provides means of analysis that aren't accessible yet, but you will get to that point, and will be glad to know that Haskell's libraries provide more suitable data structures.

`Data.Sequence`

[] is wonderful, but sometimes we want a list-like data-structure that supports a larger set of efficient operations, e.g., constant time access to both ends, or efficient concatenation. If we can accept the restrictions of finiteness and strict underlying operations, Data.Sequence provides this.

`Data.Set`

Sets are the basic abstraction of mathematics, and a useful data abstraction in programming. We have often used lists over types that implement the Eq type class to represent sets, but this can be a inefficient implementation.

If our underlying type implements Ord, we can use the Haskell Platform's Data.Set, a balanced-binary tree implementation of sets. This module is meant to be imported qualified, as many of its functions clash with Prelude functions.


    import Data.Set (Set)
    import qualified Data.Set as Set

The basic functions for constructing sets are

empty :: Set a, the empty set.
singleton :: Ord a => a -> Set a, create a one-element set.
insert :: Ord a => a -> Set a -> Set a, create a new set by inserting an element into an existing set.
delete :: Ord a => a -> Set a -> Set a, create a new set by deleting an element from an existing set.
fromList :: Ord a => [a] -> Set a, create a set from a list of elements.

Where the power of sets really comes into play is operations that combine sets, e.g.,

union :: Ord a => Set a -> Set a -> Set a, create the union of two sets.
intersection :: Ord a => Set a -> Set a -> Set a, create the intersection of two sets.
difference :: Ord a => Set a -> Set a -> Set a, create the difference between two sets. The (\\) operator is an alias for difference.

There are a large number of other operations, e.g., member for element testing, findMin, deleteMin, analogous functions for max, null which tests a set for emptiness, and size which computes the number of elements in a set. A general rule of thumb is that before writing a new Set function, check and make sure that it hasn't already been implemented in Data.Set.

One of the disappointments with Set type is that Set is not a functor or a monad, nor can it be without considerable deviousness because of the constraint that the elements of a Set must be ordered. This is something that people are thinking about.

`Data.Map`

The Map type can be thought of as a an association list with more efficient operations, but the implementation is basically that of a Set in which the values are carried along with the keys. Naturally, this gives rise the Ord type class constraint on the keys of a Map.

`Data.IntSet`, `Data.IntMap`

There are specialized versions of the Set and Map types for use in the common case where the elements/keys are of type Int, and the actual keys in use are a dense subset of an interval. This allows for an array-like implementation, with substantially better lookup efficiency.

`Data.Array`

Surprisingly, Haskell does support an array type, but remembering that Haskell is a pure language, there's an obvious issue associated with assignment, because somehow we have to end up with both the old array and the new one. The effect of this is that Haskell has an array type that supports very efficient build and look-up, but the cost of an assignment is the same as the cost of a build, i.e., linear in array size. This means that naïve translations of array-based algorithms into Haskell can be hopelessly inefficient, but... many array-based algorithms can be re-expressed as iterative builds, i.e., as a sequence of steps in which a single new array is built based on values from the old array. In such cases, the algorithms can be very efficient.

The basic array type is Array i e, where i is the index type, which must belong to the typeclass Ix, and is often Int. Note, though, that Ix is closed under tuples, e.g., there is an instance (Ix a, Ix b) => Ix (a,b), enabling the use of multidimensional arrays.

Note that the bounds of an array are not a part of its type, unlike other programming languages.

One of the places where Haskell arrays really shine is in implementing dynamic programs -- these are programs in which solutions to sub-problems are memoized, avoiding unnecessary recalculation. Consider the problem of counting the number of distinct binary trees with n nodes. This can be attacked directly via a simple recursive program:


    countTrees:: Integer -> Integer
    countTrees n =
        if n == 0
        then 1
        else sum [ countTrees left * countTrees right
                 | left <- [0..n-1]
            	 , let right = n-left-1
            	 ]

There is only one tree with zero nodes, the empty tree. For non-empty trees, after fixing the top node, we consider each of the ways of dividing the remaining nodes between the left and right subtrees, and count each way of combining a distinct left subtree with a distinct right subtree.

The code here is conceptually quite clear, but unfortunately, it's not fast: computing countTrees 15 takes a few seconds on a modern machine, but countTrees 30 would require more than a century. Can we do better?

Our approach will be to fill an Array with values of the function, while using those same values to facilitate the calculation:


    countTreesFast :: Integer -> Integer
    countTreesFast n = a ! n where
        a = array (0,n) [(i,ct i) | i <- [0..n]]
        ct n = if n == 0
               then 1
               else sum [ a ! left * a ! right
                        | left <- [0..n-1]
                        , let right = n-left-1
                        ]

Let's take this apart:

We build an Array using array :: (Ix i) => (i,i) -> [(i,e)] -> Array i e. The first argument (i,i) is the bounds of the array, inclusive. The second argument [(i,e)] is a list of associations. The list of associations don't need to be in ascending order, they don't need to be complete (i.e., some indices may be missing, in which case the corresponding element is undefined), or consistent (i.e., the same index may appear multiple times with distinct values, in which case the actual association is implementation-dependent). In this case, we've defined each value precisely once.
Note that the local ct function has a structure very similar to that of our original countTrees function, excepting that array lookups replace recursive calls. It is possible to pursue this more systematically, but we won't now.
The use of a here depends heavily on the fact that we're working with a lazy language. The initial definition associates (evaluated) keys with unevaluated thunks. It is the lookup at the end (a ! n) that drives the evaluation of these thunks.

The performance characteristics of countTreesFast are very different from countTree. We can compute


    > countTreesFast 100
    896519947090131496687170070074100632420837521538745909320

almost instantaneously, and


    > countTreesFast 1000

in a few seconds, obtaining a 598 digit answer.

`Data.ByteString`

String is wonderful, but it comes with very large space overheads and associated inefficiencies, especially if we're dealing with strings of ASCII characters on 64-bit machines. A ByteString is packed arrays of Word8's, i.e., raw bytes, suitable for high-performance/large data quantities. I've used ByteString to handle bioinformatics data-sets containing hundreds of megabytes of ASCII data. You wouldn't want to do this with ordinary Haskell strings, which take 20(!) bytes of memory to represent a single byte of data. There are basically two costs associated with ByteString: first cons is expensive, as it involves recopying the array, the second is that non-ASCII characters cannot easily be represented. The default Data.ByteString class adds the further complication that the values of the array are not typed as characters. This often makes Data.ByteString.Char8 more convenient in practice.

`Data.Text`

Data.Text provides a more efficient representation of Unicode text than String, albeit at the cost of requiring you to learn the interface.

Additional Language Features

We've presented Haskell as a pure, lazy, functional programming language. In fact, it's not perfectly pure, either in its purity, or its laziness. It's just that pure and lazy are the defaults.

`Seq` and friends.

There's a primitive Haskell function, seq :: a -> b -> b, which can be used to force the partial evaluation of a term into so-called "weak head normal form," i.e., sufficiently defined so that the least non-trivial pattern match can succeed. Using seq provides a very fine-grained control over when and how evaluation takes place, and there are sometimes good reasons for wanting to exercise this level of control.

If you have code that you think should be running in constant space, but isn't, it's often because laziness is creating thunks (unreduced expressions) where you're not expecting them. In such cases, adding a seq can force an evaluation, and eliminate the space leak.

There is a trap here. Programmers with an inaccurate understanding of Haskell's evaluation model often find it difficult to understand how seq helps, but know (if only by anecdote) that sometimes it does. So they start to "sprinkle" their code with seq's, in the hope that something good will happen. And sometimes (often enough to reward bad behavior), it does. The problem here is that unnecessary seq's can actually make your code slower and/or less space efficient. It's not that lazy vs. eager is a choice between good and bad, or bad and good if you prefer, but that each context needs to be understood before any reliable judgment can be made.

It is somewhat easier to use bang-patterns, i.e., strictness annotations on either data (which doesn't require a language extension) or on functions (which does require a language extension), which is worth keeping in mind if you need them. Seek, and you will find.

`Control.Monad.ST`

It's a bit confusing, if you dig through the Haskell Platform documentation, to encounter what appear to be two distinct state monads: Control.Monad.State, and Control.Monad.ST. The difference is that the former provides the illusion of state, implemented by pure means; whereas the later gives you honest-to-goodness assignment and mutability. It is indeed possible to write Fortran programs in Haskell.

Control.Monad.ST isn't for the faint of heart. It's verbose, and more than a bit dangerous. But the essential genius of Control.Monad.ST is that it enables you to implement "pure" functions by impure means, and there are sometimes (less often than you'd think, but not never) opportunities to write far more efficient code this way. In particular, Data.Array's accumArray function (a kind of index-based foldr for arrays) has such an implementation, and accumArray is a very useful function to know about.

The Rest of the Typeclassopedia

We've encountered a large number of algebraically- and category-theoretically inspired type classes this quarter, but we've hardly exhausted the topic. The next type class for you to learn is Traversable, after which it's useful to turn your attention to Category and Arrow. The Arrow type class is heavily used in various libraries, notably the HXT (Haskell XML Toolkit) modules.

Parallel and Concurrent Haskell

One of the nicest parts of a pure language is that it is relatively easy for the language compiler to support parallelism, as purity entails a lack of dependence on order of operations. Haskell was written in such a way as to support parallel and concurrent programming, and there are nice libraries (and an eponymous and excellent book in the O'Reilly series written by Simon Marlowe) that support it.

CMSC-16100

Honors Introduction to Programming, I

Autumn Quarter, 2015

Lecture 26: The Path Less Taken

Administrivia

Standard Modules

`Data.Sequence`

`Data.Set`

`Data.Map`

`Data.IntSet`, `Data.IntMap`

`Data.Array`

`Data.ByteString`

`Data.Text`

Additional Language Features

`Seq` and friends.

`Control.Monad.ST`

The Rest of the Typeclassopedia

Parallel and Concurrent Haskell

CMSC-16100

Honors Introduction to Programming, I

Autumn Quarter, 2015

Lecture 26: The Path Less Taken

Administrivia

Standard Modules

Data.Sequence

Data.Set

Data.Map

Data.IntSet, Data.IntMap

Data.Array

Data.ByteString

Data.Text

Additional Language Features

Seq and friends.

Control.Monad.ST

The Rest of the Typeclassopedia

Parallel and Concurrent Haskell

`Data.Sequence`

`Data.Set`

`Data.Map`

`Data.IntSet`, `Data.IntMap`

`Data.Array`

`Data.ByteString`

`Data.Text`

`Seq` and friends.

`Control.Monad.ST`