Lecture 22: Seq and All That

Evaluation: the Term Model.

Haskell is defined in terms of lazy evaluation, which we can describe briefly as a computational plan that evaluates only what is necessary, and when it evaluates something, it only evaluates it once. For the most part, we can get away with this sort of informal description, but there comes a point when we can't, and need better approximation.

Let's start by considering a simple function:

mySumR :: [Int] -> Int mySumR [] = 0 mySumR (n:ns) = n + mySumR ns

This is definition of summation via a natural recursion, with the terminal R in mySumR intended to recall foldr. This code works well, at least until we try to sum a long list:

main = print $ mySumR [1..10^8]

This computes correctly, but it takes a surprisingly long amount of time to do so (a bit over 4.5 seconds, even at -O2). Profiling this code produces a somewhat surprising and disconcerting output:

18,495,737,248 bytes allocated in the heap 1,390,456 bytes copied during GC 1,674,375,728 bytes maximum residency (13 sample(s)) 31,592 bytes maximum slop 2458 MB total memory in use (38 MB lost due to fragmentation)

The thing to notice here is the memory footprint: At some point in the computation, we were using 1.67GB of heap, and the total memory footprint reached 2.46GB. That's a whole heck of a lot of memory for what seems like a fairly simple task. A few moments of introspection reveals a possible cause: as we work our way through the list, we build an arithmetic expression to evaluate. But that expression grows to be essentially the size of the list we're processing, and we can't make progress on reducing that expression until we've built all of it. Ugly, especially as we know that traditional languages can do this in constant space, cf.,

def sum(ns): acc = 0 for n in ns: acc += n return acc print(sum(range(1,10**8+1)))

and we'd like to think that Haskell could too.

A simple analysis of this program shows that sum is accumulating the sum of from the front of the list to the back, unlike mySumR which built up an expression for the sum from back to front. Indeed, the iterative structure uses a register variable to hold a partially evaluated sum, rather than building up the stack of deferred function calls implied by a recursive function. There are at this point two fairly natural ways to think about this. Former Scheme programmers would think about tail-recursion and register variables, and note that mySumR isn't tail recursive, and so would look for a tail recursive analog to the Python program. Cynical students, i.e., students who know me, might read something into my emphasis of the foldr-like structure of mySumR, and wonder about what a foldl solution might look like. Either way, we end up with essentially the same code:

mySumL :: [Int] -> Int mySumL = iter 0 where iter r [] = r iter r (n:ns) = iter (r+n) ns

A few moments of reflection around the notion of laziness, and we might even hope that our original program, updated to use mySumL rather than mySumR might run in constant space. After all, if the list is being built up lazily and consumed as it's being built, there will never be more than a single (:) active at a time. We're surprised and, indeed, horrified, to discover that this hopeful change make our code perform much worse, requiring almost a half-minute of run time and

26,620,313,904 bytes allocated in the heap 38,121,812,496 bytes copied during GC 7,934,690,960 bytes maximum residency (16 sample(s)) 190,450,032 bytes maximum slop 17973 MB total memory in use (0 MB lost due to fragmentation)

Ouch!! That's 7.94GB of heap, and 26.6GB of total memory allocation! Don't try this on your laptop!

Anyway, let's try to understand what's going on, by "playing the computer," and working our way through some modest sized examples (summing [1..4]). We'll use a simple term re-writing model, in which both the traditional eager evaluation and Haskell's lazy evaluation share the same re-write rules, but apply them in a different order. A preliminary issue is that Haskell's range notation [a..b] is syntactic sugar for enumFromTo a b, which we'll assume for our present purposes is defined as:

enumFromTo a b = case compare a b of LT -> a : enumFromTo (succ a) b EQ -> [b] GT -> []

If we were doing the usual eager model of calculation, we have a pattern in which we first fully evaluate the argument to mySumR:

mySumR [1..4] ==> mySumR (enumFromTo 1 4) -- desugar [..] ==> mySumR (1:enumFromTo 2 4) -- expand enumFromTo ==> mySumR (1:2:enumFromTo 3 4) -- expand enumFromTo ==> mySumR (1:2:3:enumFromTo 4 4) -- expand enumFromTo ==> mySumR (1:2:3:4:[]) -- expand enumFromTo

Note here that we're eliding a few steps, e.g., around the evaluation of enumFromTo in order to focus on the evaluation of mySumR.

Once the argument to mySumR is evaluated, we can replace the application by its definition, i.e.,

==> 1 + mySumR (2:3:4:[]) -- expand mySumR ==> 1 + (2 + mySumR (3:4:[])) -- expand mySumR ==> 1 + (2 + (3 + mySumR (4:[]))) -- expand mySumR ==> 1 + (2 + (3 + (4 + mySumR []))) -- expand mySumR ==> 1 + (2 + (3 + (4 + 0))) -- expand mySumR

Finally, we'd simplify the resulting expression:

1 + (2 + (3 + (4 + 0))) ==> 1 + (2 + (3 + 4)) ==> 1 + (2 + 7) ==> 1 + 9 ==> 10

You can clearly see in this process the building up of the complete list [1,2,3,4], and the building up of the full arithmetic expression 1 + (2 + (3 + (4 + 0))). Now Haskell doesn't work quite this way. Haskell is lazy. So let's work through a lazy evaluation. It starts out much the same way:

mySumR [1..4] ==> mySumR (enumFromTo 1 4) -- desugar [..] ==> mySumR (1:enumFromTo 2 4) -- expand enumFromTo

At this point, the argument to mySumR is sufficiently defined (it's a cons) to determine that we're in the cons case of the definition of mySumR:

==> 1 + mySumR (enumFromTo 2 4) -- expand mySumR

At this point we can't make progress on the addition, because its second argument isn't sufficiently well-defined, and we can't make progress on mySumR, because its argument isn't sufficiently well-defined, but we can make progress on enumFromTo, concluding as follows:

==> 1 + (mySumR (2 : enumFromTo 3 4)) -- expand enumFromTo ==> 1 + (2 + mySumR (enumFromTo 3 4)) -- expand mySumR ==> 1 + (2 + mySumR (3 : enumFromTo 4 4)) -- expand enumFromTo ==> 1 + (2 + (3 + mySumR (enumFromTo 4 4))) -- expand mySumR ==> 1 + (2 + (3 + mySumR (4:[]))) -- expand enumFromTo ==> 1 + (2 + (3 + (4 + mySumR []))) -- expand mySumR ==> 1 + (2 + (3 + (4 + 0))) -- expand mySumR ... ==> 10

Notice that, as our intuition earlier suggested, there's only a single (:) present in any of terms of the reduction sequence. This means that we can garbage collect the list more-or-less as we construct it, and the list contributes only a constant amount to our maximum active memory. But we're not so fortunate with the arithmetic expression. Too bad.

One thing to note here is that the results of eager and lazy evaluation are the same. This isn't an accident, but instead is a consequence of the Church-Rosser Theorem of the λ-calculus, which states that any two sequences of reductions to a term $t$ result in terms $t_1$ and $t_2$ that can be "brought back together" by additional reductions. This is sometimes called the confluence property. As normal forms can't be further reduced, this means that if two normal forms result from sequences of reductions applied to the same initial lambda-term, they must be equal. We'll have a bit more to say about this later.

So let's consider mySumL. Again, we'll first do this via eager evaluation, which begins much like the mySumR case by expanding the full list:

mySumL [1..4] ==> mySumL (enumFromTo 1 4) -- desugar [..] ==> mySumL (1:enumFromTo 2 4) -- expand enumFromTo ==> mySumL (1:2:enumFromTo 3 4) -- expand enumFromTo ==> mySumL (1:2:3:enumFromTo 4 4) -- expand enumFromTo ==> mySumL (1:2:3:4:[]) -- expand enumFromTo

At this point, we expand mySumL in terms of its local iter procedure, which we'll indicate via mySumL.iter:

==> mySumL.iter 0 (1:2:3:4:[]) -- expand mySumL ==> mySumL.iter (0+1) (2:3:4:[]) -- expand mySumL.iter

Now, eager evaluation steps in at this point, and forces the full evaluation of the first argument to mySumL.iter:

==> mySumL.iter 1 (2:3:4:[]) -- simplify addition

The evaluation continues

==> mySumL.iter (1+2) (3:4:[]) -- expand mySumL.iter ==> mySumL.iter 3 (3:4:[]) -- simplify addition ==> mySumL.iter (3+3) (4:[]) -- expand mySumL.iter ==> mySumL.iter 6 (4:[]) -- simplify addition ==> mySumL.iter (6+4) [] -- expand mySumL.iter ==> mySumL.iter 10 [] -- simplify addition ==> 10 -- expand mySumL.iter via the [] case

The thing to note here is that, even though eager evaluation forced us to build the full list, we never had more than a single pending addition. Eager evaluation enabled us not to spend space on the accumulating variable.

Let's consider lazy evaluation of the same expression:

mySumL [1..4] ==> mySumL (enumFromTo 1 4) -- desugar [..]

At this point, we don't have to wait for any evaluation of the argument to mySumL, so we expand immediately:

==> mySumL.iter 0 (enumFromTo 1 4) -- expand enumFromTo

To expand mySumL.iter further, we need the second argument to be more well-defined, so our evaluation continues there:

==> mySumL.iter 0 (1:enumFromTo 2 4) -- expand enumFromTo

And now the second argument is sufficiently well-defined to permit the expansion of mySumL.iter:

==> mySumL.iter (0+1) (enumFromTo 2 4) -- expand mySumL.iter

At this point, we'd like to continue (as we did in the eager case) with the evaluation of the sum, but mySumL.iter doesn't depend on the value of its first argument to determine which case to apply, so it invests its evaluation effort on the second argument. This is a very important point, and I want you to think about it until you understand why Haskell makes the choice it's making. The evaluation continues:

==> mySumL.iter (0+1) (2:enumFromTo 3 4) ==> mySumL.iter ((0+1)+2) (enumFromTo 3 4) ==> mySumL.iter ((0+1)+2) (3:enumFromTo 4 4) ==> mySumL.iter (((0+1)+2)+3) (enumFromTo 4 4) ==> mySumL.iter (((0+1)+2)+3) (4:[]) ==> mySumL.iter ((((0+1)+2)+3)+4) [] ==> ((((0+1)+2)+3)+4)

Exercise 22.1 Annotate the reduction steps above in the style of the foregoing discussion.

The evaluation now amounts to simplifying an algebraic expression. But the important thing to note here is that our lazy evaluation commits the complementary sin to eager evaluation: it doesn't require that the whole list ever be resident in memory, but it does require that the whole algebraic expression be resident in memory.

It seems that our search for a constant-space summation fails for one reason with eager evaluation, and for another with lazy evaluation. We can trace out an evaluation process that would do this, but can we make it happen?

Exercise 22.2* Produce an evaluation sequence in the style of above, which never involves more than a single cons or addition.

Evaluation strategy vs. strictness

Before getting on with directing evaluation strategies, let's talk briefly about strictness. The Church-Rosser Theorem entails that two reduction sequences that begin with the same term, and end with normal forms, must end with the same normal form. But this does not say that if one evaluation strategy results in a normal form, then all evaluation strategies result in a normal form. Here's a simple example. Consider the simple infinite loop function:

loop :: a loop = loop

and the evaluation of const 0 loop. If we use eager evaluation, we'll attempt to evaluate the loop argument, resulting in a high speed twiddling of our thumbs:

const 0 loop ==> const 0 loop -- expand loop ==> const 0 loop -- expand loop ==> const 0 loop -- expand loop ...

But if we use normal order, expanding the definition of const before attempting to evaluate loop:

const 0 loop -- expand const, done ==> 0

We'll say that a function f is strict in one of its arguments with respect to an evaluation strategy if the divergence of that argument forces the divergence of the function when evaluated using that strategy. Since languages usually define evaluation strategies, we'll often speak of the strictness (or lack thereof) of a function without explicit reference to that language's defined evaluation strategy.

Eager (sometimes called call-by-value) is the strictest evaluation strategy: eager evaluation evaluates every subterm of a term, hence, if any sub-term diverges, so too with the top-level evaluation. Some consider this to be a bug, others consider it to be a feature. Lazy evaluation comes with the opposite promise: if any evaluation strategy succeeds in reducing a term to normal form, it will succeed. Again, some consider this to be a bug, others consider it to be a feature. But the basic value proposition of normal-order evaluation is that it will evaluate every sub-expression as often as it needs to, which may well be a lot of times, but it may also be zero times.

GHC and Lazy Evaluation

At this point, I feel a slight need to come clean. If you've been following along with ghc, you might have noticed that it's perfectly happy and reasonably efficient in evaluating mySumL [1..10^8], even on laptops with modest memory. I'm tempted to ask, what do you believe? Your humble instructor who would never knowingly lead you astray, or your own lying eyes? But I probably wouldn't be happy with the answer.

The truth here involves a great, and even surprising subtlety. Haskell does not require that Haskell compilers/interpreters use lazy evaluation. It just requires that they'll always produce as well-defined a result as lazy evaluation would. Now the people who write ghc are very clever, and examples like the one above are embarrassing. So ghc does a strictness analysis of the code it compiles, even at normal optimization levels. I actually had to explicitly turn off strictness analysis to get the results above. This strictness analysis enables ghc to find the "golden path" evaluation that runs in constant space. This might lead you to conclude, "I'll just use ghc and not worry." While using ghc is a good idea, the problem with this is that the strictness analyzer isn't omnipotent, and exactly the issue described here can and does happen with real-world code. So we've wasted no effort in this discussion, we've just employed a pedagogical example that's handled differently (i.e., better) in practice.

Graph Reduction

Our discussion of evaluation strategies assumed a term evaluation model. A fuller understanding of lazy evaluation (vs. the lambda-calculus's normal-order evaluation) requires a more sophisticated mental model of graph reduction. We'll give a brief (and as usual, incomplete) account of that, before returning to the problem of a constant space summing.

Let's consider a very simple program:

square x = x * x main = print $ square (3+3)

Again, the print function will drive the evaluation of square (3+3). We'll start with the following graph structure:

Our lazy evaluation begins by replacing the square node with a multiplication node, in which the argument to square ends up as both arguments to the resulting multiplication:

The next step of the evaluation looks at the first argument to (*), indicated above by a darker line, to the (+) node, and since both of its arguments are already reduced, it is reduced to 6.

It is crucially important to understand that because of the sharing involved, the effect of reducing 3+3 to 6 in the first argument to (*) also reduces it in the second, as shown above. Thus, when the evaluation of the arguments to the (*) moves to the second argument, the evaluator finds that node already evaluated, and needs to do no more work there. Instead, the top level (*) is ready to be evaluated, and its node is replaced by 36:

This brief discussion hints at the complexity involved. For a slightly more complete discussion, consult HaskellWiki: Graph Reduction.

Returning to the main problem

Let's recall the space usage issue from the beginning of the lecture. We want to sum [1..10^8], but without incurring excessive space use. We know that there is a reduction sequence that requires only constant space based on the definition of mySumL, but that both eager and lazy evaluation result in the use of huge amounts of memory.

An ad hoc solution.

We'll start by revisiting the lazy evaluation of mySumL [1..4], and stop at the moment the train left the tracks:

mySumL [1..4] ==> mySumL (enumFromTo 1 4) -- desugar [..] ==> mySumL.iter 0 (enumFromTo 1 4) -- expand enumFromTo ==> mySumL.iter 0 (1:enumFromTo 2 4) -- expand enumFromTo ==> mySumL.iter (0+1) (enumFromTo 2 4) -- expand mySumL.iter

At this point, the next reduction step focusses on the second argument to mySumL.iter, because it doesn't depend on the first. What if, what if, what if, ... we could somehow make it depend on that first argument?

Consider the following, rather odd code:

mySumWithForce :: [Int] -> Int mySumWithForce = iter 0 where iter r [] = r iter r (n:ns) | force r = iter (r+n) ns force n = n /= 0 || n == 0

The idea here is that the force function will always return True, but even the mighty ghc can't figure this out. So it has to run7 the test, and thereby evaluate r before it can expand MySumWithForce.iter. This keeps the argument simple, if not completely reduced, and allows the code to run in constant space. Oddly enough, that's what the profile says:

20,800,063,448 bytes allocated in the heap 3,735,120 bytes copied during GC 46,232 bytes maximum residency (2 sample(s)) 31,592 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)

The difference between 46K and 7.94GB should definitely get your attention! Factors on the order of 200,000 usually get mine.

Introducing seq

This brings us to one of the more mysterious functions of Haskell, seq.

seq :: a -> b -> b

Informally, seq sequences the reduction of its arguments to weak head normal form (i.e., just enough to do the weakest possible non-trivial pattern match). More precisely, evaluation in Haskell is built around forcing a thunk (a delayed evaluation) through just enough steps so that it is no longer simply a thunk, but a non-trivially defined partially evaluated expression. For algebraic types, this means that the top-level constructor has been determined. For atomic types like Int, it amounts to full evaluation. Function types don't get reduced at all (hence, “weak”).

Using seq effectively is a difficult task, and there's many a Haskell programmer with a weak understanding of seq who tries to deal with space leaks by sprinkling seq's throughout their code, and hoping for the best. This is rarely an effective strategy, and it's often counterproductive. Extra strictness is usually going to make your program run slower. But the code we've been working up sets us up for the proper use of seq as a replacement for force:

mySumWithSeq :: [Int] -> Int mySumWithSeq = iter 0 where iter r [] = r iter r (n:ns) = let r' = r+n in r' `seq` iter r' ns

This has the effect of reducing the accumulator argument in the call to iter to weak head normal form (i.e., fully evaluating it because it is an atomic type), before we make the call. This use of seq as an infix operator, together with a let binding of the arguments of a function, is fairly common.

Unsurprisingly, mySumWithSeq has constant space use.

12,800,064,600 bytes allocated in the heap 1,181,848 bytes copied during GC 46,232 bytes maximum residency (2 sample(s)) 31,592 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)

Bask in the efficiency of a well-managed evaluation process!

Bang!

Although seq is in some sense all you need, in practice many Haskell programmers find it difficult to use. So Haskell provides an alternative: strictness annotations on the components of algebraic data types, which have the effect of making constructors (when used as functions) strict, i.e., they evaluate their arguments to weak head normal form. Haskell programmers can then use these strictness-augmented type constructors to tune the use of strictness in their code. Let's consider a simple example:

data SPair a b = SPair !a b mySumWithSPair :: [Int] -> Int mySumWithSPair ns = iter (SPair 0 ns) where iter (SPair r []) = r iter (SPair r (n:ns)) = iter (SPair (r+n) ns)

Note the use of an exclamation point to indicate that the first argument (which we'll use to hold the register in the computation) is strict.

The idea here is that the arguments to mySumWithSPair.iter are marshalled into an SPair value, whose strictness annotation forces the evaluation of the first argument, and thereby avoids building a large deferred calculation. This program results in slightly more than twice the overall allocation of space (those SPairs aren't free), but essentially the same maximum residency as mySumWithSeq.

Bang patterns

The problem with strictness annotations is that they seem to require the definition of a new data type for every function that benefits from additional strictness. GHC is experimenting with an alternative approach: the addition of strictness annotations to the arguments to a function. Consider:

mySumWithBang :: [Int] -> Int mySumWithBang = iter 0 where iter r [] = r iter !r (n:ns) = iter (r+n) ns

This code is identical with that of mySumL, save for the exclamation point preceding r in the second clause of the definition of mySumWithBang.iter. The effect of the bang is to require the (partial) evaluation of the corresponding argument before a match can be made. Note that bang patterns are specific to GHC, and are not a feature of the Haskell definition, moreover, that their use requires the {-# LANGUAGE BangPatterns #-} pragma, and/or the equivalent command-line flag during compilation.

Conclusions

The lazy evaluation process of Haskell can sometimes result in shocking large space use, which in turn can result in poor runtime performance. Identifying the causes of excessive space use isn't easy, and it requires a more complete understanding of Haskell's evaluation process. Titrating a bit of strictness in at the right points can make a huge difference. But strictness isn't a magic bullet, as strictness in the wrong places can actually hurt overall performance.

In terms of time complexity, mySumWithSeq was by far the fastest, requiring about 1.3 seconds. The other space-efficient programs typically required about twice as much time, and the space inefficient programs could require up to a half a minute, so this matters. Oh, and that Python code? 5.8 seconds.