Supplemental Lecture 16a: Scalability

\( \newcommand{\length}[1]{\vert#1\vert} \)

Programming requires attention to correctness and efficiency. The dominant focus of this class has been on correctness, and on using the type system of Haskell and the type classes of its library as aids to writing correct code. This reflects our ontology of programming: it's easier and more satisfactory to tune correct code so that it runs faster, than it is to debug fast code so that it runs correctly. Getting it right, in our view, is a constraint, it's something we must do. Making it fast is a goal, something we hope to do, and want to make progress towards.

In this, the Document type from our second State lecture fails the test of efficiency. It is a common failing, and one that's worth understanding. We build our document from front to back, in effect, building

(... ((s1 ++ s2) ++ s3) ... ++ sn)

Let's recall the implementation of (++):

(++) :: [a] -> [a] -> [a] (++) [] ys = ys (++) (x:xs) ys = x : xs ++ ys

It is easy to show, by induction on the length of xs, that fully evaluating x ++ y requires $\length{x}$ new (:)'s. The cost of building these (:)'s is the dominant cost of evaluating the (++).

Consider now the overall cost of building a string via (++). For the sake of simplicity, we'll consider (((s1 ++ s2) ++ s3) ++ s4).

s1 ++ s2$\length{s_1}$
(s1 ++ s2) ++ s3$\length{s_1} + \length{s_2}$
(((s1 ++ s2) ++ s3) ++ s4)$\length{s_1} + \length{s_2} +\length{s_3}$

So the overall cost is the sum of these operation costs, i.e., $3 \length{s_1} + 2 \length{s_2} + \length{s_3}$. Contrast this to the cost of building the same string via the same operations, but with the (++)'s associated to the right, i.e, s1 ++ (s2 ++ (s3 ++ s4)):

s3 ++ s4$\length{s_3}$
s2 ++ (s3 ++ s4)$\length{s_2}$
s1 ++ (s2 ++ (s3 ++ s4))$\length{s_1}$

So the overall cost of building the string this way is $\length{s_1} + \length{s_2} + \length{s_3}$, because each (:) we build contributes to the final answer, whereas we build a lot of ephemeral (:)'s when the (++)'s associate to the left.

This can make a difference! Building a moderately complex web page might involve thousands of calls to string, meaning that we end up re-copying the (:)'s associated with early calls to string thousands of times.

Fortunately, we can address this problem rewriting only a little bit of code, as there's an abstraction barrier between Document and HTML, so we need only consider the definition of Document, and make sure that the string and render functions are aligned with it. We will explore several different solutions. For the sake of simplicity, we won't do our usual code transformations, but will just present natural first-pass code.

Let Simon Do It

An efficient solution to problems like this is to rely on the fact that others have encountered them before, resulting in carefully engineered, widely-deployed, and so well-tested solutions. In this case, it's well known that String is inefficient, not only in the cost of operations like (++), but also in terms of storage space. The module Data.Text uses a packed utf-16 representation of Unicode, while the Data.Text.Lazy module adds to this by considering chunked representations (i.e., the text is a sequence of packed utf-16 representations). We can then define:

import Data.Text.Lazy type Document = State Text {- Append a String to a Document. -} string :: String -> Document () string t = modify $ \s -> append s (pack t) {- Render a Document a as a String -} render :: Document a -> String render doc = unpack $ execState doc empty

Note that the pack and unpack functions convert between String and Text values, and that append is used to append Text values.

It turns out that there are a fair number of alternative representations for what we think of as list-like data, including Data.Array which allows efficient access to lists of fixed length, and Data.Sequence which allows for efficient access to both the front and back of a list. The existence of these alternative representations explains the existence of many of the standard type classes, e.g., Functor, Foldable, etc., all of which capture standard list-programming techniques. To the extend that we write our list-based code using the functionality of these type classes, we abstract away from specific representation decisions, making it easier to transition from simpler to more complex (and efficient) data representations later on.

The Composition Trick, a.k.a., Difference Lists

This is a technique that is used in several places in the Haskell library. We build up the efficient right-associative use of (++) via a composition of functions that associates to the left. There's something deeply sneaky about this, as we're trading off adding content to beginning, for a promise to content to the end!

type Document = State (String -> String)

The idea is that the function we're building tacks the String we've built onto its argument.

{- Append a String to a Document. -} string :: String -> Document () string t = modify $ \f -> f . (t ++) {- Render a Document a as a String -} render :: Document a -> String render doc = execState doc id ""

Note in particular our implementation of render. We start the ball rolling by passing the identity function to execState, which represents an empty document. When we get the final function out, we apply it to the empty string, as there is no "rest of the string" to build.

Be Lazier!

The idea here is to defer the (++)'s to the end. So we build a list of strings to combine, but we organize that list from back-to-front, i.e., reversed in order from the document we intend to produce. This means that the actual joining together of the lists happens with render, rather than with string:

type Document = State [String] {- Append a String to a Document. -} string :: String -> Document () string t = modify $ \s -> t : s {- Render a Document a as a String -} render :: Document a -> String render doc = concat . reverse . execState doc $ []


All of these changes will turn a program that had quadratic run-time to one that has linear run-time. The tuned programs scale. But what is perhaps most remarkable about this exercise is that in each case, the new code ran correctly the first time! This had much to do with the work we put into our initial list-based solution, and in particular, the simplicity of the Document abstraction, and how thoroughly the rest of the code was decoupled from the particular representation choice we made within Document.

Programs work at different scales. Inefficiencies can creep in at any of these scales, e.g., in the algorithms used, in the data representations used, in the efficiency with which those representations are implemented. Traditional languages try to snuggle up to the underlying hardware (this is especially so in C), and naturally draw the programmer's attention to improving the efficiency of the chosen implementation, which is not a bad thing, except to the extent that it draws their attention away from larger inefficiencies in the algorithms and representation choices, and thereby creates an investment in the higher-level choices that were made. Haskell seems in practice to encourage more of a top-down approach, and this tends to deliver greater fruit faster.