Lecture 17: Introduction to Functional Parsing

A parser is a function/procedure that translates a String representing a value of type a into that value. It's tempting to propose the following as the definition of an abstract Parser type:

type Parser s = String -> s

The problem here is that if we're building a parser out of pieces, the pieces are going to be sub-parsers that consume part but not all of the input. This suggests the following:

type Parser s = String -> (s,String)

where we're returning a pair that consists of the result of parsing an initial segment of the input string, and the unparsed remainder. This may be reminiscent of the State monad, specialized to String. Such a definition would allow us to write code like this:

pairp :: Parser a -> Parser b -> Parser (a,b) pairp ap bp s = ((a,b),u) where (a,t) = ap s (b,u) = bp t

or even, if we somehow monadify Parser along the lines of State,

pairp :: Parser a -> Parser b -> Parser (a,b) pairp ap bp = do a <- ap b <- bp pure (a,b)

or, after putting our applicative thinking hats on, even

pairp = liftA2 (,)

both of which hint at things to come. Unfortunately, though, this definition of Parser isn't quite robust enough. Consider, e.g., the special case of trying to parse a simple arithmetic expression. Suppose we had

data Expression = Const Double | Add Expression Expression | Mul Expression Expression expressionParse :: Parser Expression

What should expressionParse "1+2*3" return? Obviously, we're looking for the result

But what about

All are plausible, in that they meet the contract for expressionParse, even though the last one is a bit problematic. We don't just have one right answer, it seems we have three! It's useful to think this as a non-deterministic calculation. So let's return a list:

type Parser s = String -> [(s,String)]

Of course, we're going to want to make Parser an instance of various standard type classes, so we'll use newtype as before:

newtype Parser s = Parser { runParser :: String -> [(s,String)] }

The result looks like a mash-up of a State and a []. This is an interesting and productive observation.

Following Joroen Fokker, who first wrote this sort of functional parser, we now build some simple parsers, e.g., satisfy “shifts” the first character of input if it satisfies the argument predicate:

satisfy :: (Char -> Bool) -> Parser Char satisfy p = Parser $ \s -> case s of [] -> [] a:as | p a -> [(a,as)] | otherwise -> []

Note here the spiffy use of guards within patterns within a case statement. There's a really cute way to clean up that the last little bit, which is called the Fokker trick:

satisfy :: (Char -> Bool) -> Parser Char satisfy p = Parser $ \s -> case s of [] -> [] a:as -> [(a,as) | p a]

How does this work? If p a is false, we end up returning the [] type's empty which is just [], so [(a,as) | p a] is either [(a,as)] or [], according to whether p a is True or False, respectively. It's a cute trick, and well worth remembering.

We can use satisfy to define a number of additional simple parsers

char :: Char -> Parser Char char c = satisfy (c==) alpha, digit, space :: Parser Char alpha = satisfy isAlpha digit = satisfy isDigit space = satisfy isSpace

*Exercise 17.1 Show by a series of principled transformations that we can define:

char :: Char -> Parser Char char = satisfy . (==)

The character predicates from Data.Char all beg to be turned into simple parsers, in similar fashion.

Next, we have a simple parser that recognizes a string:

string :: String -> Parser String string str = Parser $ \s -> [(t,u) | let (t,u) = splitAt (length str) s, str == t]

Here, again, we use the Fokker trick, creating either a zero- or one-element list. A remarkable fact is that this pretty much concludes our effort to build primitive parsers, further progress is going to take the form of adding Parser to various standard type classes, leveraging the power of these simple functions.

We'll start by considering the task of writing a Parser Bool. We can start by considering a couple of primitive parsers for recognizing the strings "True" and "False".

parseTrue = string "True" parseFalse = string "False"

An obvious first problem is that these parsers both have type Parser String, rather than Parser Bool. To get a Parser Bool, it's convenient to make Parser an instance of Functor.

instance Functor Parser where fmap f p = Parser $ \s -> [(f a,t) | (a,t) <- runParser p s]

With the Functor instance in hand, we can write:

parseTrue = (const True) <$> (string "True") parseFalse = (const False) <$> (string "False")

and perhaps recognize a useful pattern here:

token :: String -> a -> Parser a token s a = const a <$> string s parseTrue, parseFalse :: Parser Bool parseTrue = token "True" True parseFalse = token "False" False

With this,

> runParser parseTrue "True" [(True,"")]

which isn't exactly what we want, but it's a solid step in the right direction.

Of course, we don't want to be able to parse "True" or parse "False", we want to be able to parse a string that contains either. Somehow, we want to combine the two parsers into one parser, merging their outputs. There are a number of ways to approach this. We could make Parser s an instance of Monoid, or we could make Parser an instance of Alternative. Indeed, we could do both. There is an apparent cost associated with making Parser an instance of Alternative, and that is that Alternative requires Applicative. But a moment's consideration should suggest that we're going to want to combine the outputs of multiple parsers, as hinted at with the pairp example, and we know that the easiest way to provide generalizations of fmap to functions of greater arity is through Applicative anyway. So...

instance Applicative Parser where pure a = Parser $ \s -> [(a,s)] af <*> aa = Parser $ \s -> [ (f a,u) | (f,t) <- runParser af s , (a,u) <- runParser aa t ] instance Alternative Parser where empty = Parser $ \s -> [] p1 <|> p2 = Parser $ \s -> runParser p1 s ++ runParser p2 s

With this in hand, not only does our pairp example work, we can finish the Parser Bool:

parseBool :: Parser Bool parseBool = token "True" True <|> token "False" False

An alternative solution, well worth remembering in other contexts, is

parseBool = read <$> (string "True" <|> string "False")

At this point, it's standard practice to introduce parser combinators many and some, which given a Parser a return a Parser [a], where the many version returns a list of zero or more successful parses, and some a list of one or more parses. It's natural to write these via mutual recursion, as

many, some :: Parser a -> Parser [a] many p = some p <|> pure [] some p = liftA2 (:) p (many p)

Much to our surprise, attempting to do so results in an error: many and some are already defined in Control.Applicative, with the type signature

many, some :: Alternative f => f a -> f [a]

Moreover, a quick consideration of our implementation reveals that it only uses functions in the Alternative or Applicative type classes. Unsurprisingly, we've just reimplemented these standard functions! Likewise, it's often useful to have an optional parser combinator, which is used for elements that may be present, but it too is already available in Control.Applicative, simplifying our work considerably.

For example:

> runParser (many parseBool) "TrueTrue!" [([True,True],"!"),([True],"True!"),([],"TrueTrue!")] > runParser (some parseBool) "TrueTrue!" [([True,True],"!"),([True],"True!")] > runParser (optional parseBool) "TrueTrue!" [(Just True,"True!"),(Nothing,"TrueTrue!")]

We can now consider a mildly non-trivial parsing problem. Consider the type

data IntV = IntV Int deriving (Show)

This is just an Int in a box. Can we parse this given its natural syntax?

parseInt = read <$> some digit skipSpaces = const () <$> many space parseIntV = liftA3 (\_ _ i -> IntV i) (string "IntV") skipSpaces parseInt

For example:

> runParser parseIntV "123!" [] > runParser parseIntV "IntV 123!" [(IntV 123,"!"),(IntV 12,"3!"),(IntV 1,"23!")]

A nice consequence of writing parseIntV is that it enables us to add IntV to the Read class, via

instance Read IntV where readsPrec _ = runParser parseIntV

Which we can test with

> read "IntV 18" :: IntV IntV 18

Sweet! Of course, we could have also achieved the ability to read IntV values like this by including Read in the deriving clause of the type definition, but explicitly defining the Read IntV instance allows us to choose a different syntax if desired. Note that the wildcard argument to readsPrec is a precedence level, which means that we can have different parsers associated with different parsing contexts, a capability that's not often used.

There's a small gap here, in that the parsers we're building return multiple results, whereas read returns only a single result. The bridge is in terms of code that filters the result list for those pairs (a,t) where the unparsed string t consists only of whitespace, and then requires that there be only a single result of that form, e.g.,

parseWith :: Parser a -> String -> a parseWith p s = case [a | (a,t) <- runParser p s, all isSpace t] of [a] -> a [] -> error "no parse" _ -> error "ambiguous parse"

It's reasonable to argue that this might have been better handled by returning a value of type Either String a than by throwing an exception via error. When you write your own language, be sure to do it that way.

Monadic Parsing

Historically, functional programmers approached parser combinators through Monad rather than through Applicative and friends. It's worth noting that we can make Parser into a monad, and this has the considerable advantage of making monadic do-notation available to us, although pending changes to GHC/Hackage will permit the use of restricted versions of do with Applicative, largely eliminating this advantage.

instance Monad Parser where p >>= g = Parser $ \s -> [ (b,u) | (a,t) <- runParser p s , (b,u) <- runParser (g a) t ] instance MonadPlus Parser

With these definitions, we could have written

parseIntV :: Parser IntV parseIntV = do string "IntV" skipSpaces i <- parseInt pure $ IntV i

which has the advantage of conceptual simplicity. That said, contemporary practice favors using Applicative based constructors over Monad based constructors, and the results argue for themselves in terms of concision if not always clarity.

*Exercise 17.2 Consider the following type declaration:

data ComplexInt = ComplexInt Int Int deriving (Show)

Use Parser to implement Read ComplexInt, where you can accept either the simple integer syntax "12" for ComplexInt 12 0 or "(1,2)" for ComplexInt 1 2, and illustrate that read works as expected (when its return type is specialized appropriately) on these examples. Don't worry (yet!) about the possibility of minus signs in the specification of natural numbers.

This is cool stuff, and we can and will take this style of parsing very far. This isn't our father's Fortran, nor indeed our high school Java or Scheme.

With all of this work behind us, the end product is remarkably concise and powerful parser combinator library that requires less than a full page of code. Just to be clear here: we haven't written a parser in less than a full page of code, we've written a module for writing parsers in less than a full page of code. And we're not done yet.

Ἀπὸ μηχανῆς θεός

Remember back when we remarked that original definition of Parser looked like a mashup of a State monad and a [] monad? We can breathe life into this observation by noting that Control.Monad.State ultimately defines State in terms of the monad transformer StateT. Looking at the definition of StateT, we have

newtype StateT s m a = StateT { runStateT :: s -> m (a,s) }

So, if we define

type Parser = StateT String []

we at least get the type right. But here's where the wild magic begins: if we look at the instances provided by StateT, we'll see:

Functor m => Functor (StateT s m) (Functor m, Monad m) => Applicative (StateT s m) (Functor m, MonadPlus m) => Alternative (StateT s m) Monad m => Monad (StateT s m) MonadPlus m => MonadPlus (StateT s m)

In the current context, m is [], and we know that [] is a Functor, a Monad, and a MonadPlus, and as such, our Parser type will be an instance of Functor, Applicative, Alternative, Monad, and MonadPlus for free! Moreover, once we have these instances, we can use combinators for these type classes to simplify our base definitions, e.g.,

string :: String -> Parser String string = mapM char

Pulling this all together, we have

Not a parser, but a powerful module for writing parsers . Try that in C++ or Java ;-).

The sudden appearance of monad transformers here very much has the flavor of a deus ex machina, and you're all very much to be forgiven if you're not only saying to yourself, “I didn't see that coming,” but “Is he really expecting that I'm going to be able to pull this kind of wild rabbit out of my hat?” No. Or at least, not yet. In truth, I gave a predecessor version of this lecture for three consecutive years without invoking this wild magic, only realizing in the summer before the fourth year that not only could the previously mysterious monad transformers be used here, but they resulted in a Parser class that was both more concise and more powerful. It was a programmer's epiphany—to paraphrase the great Hungarian combinatorialist Paul Erdös, I felt as if I'd been granted a peak at a page of code from God's own git repository.

But that said, you're unlikely to find what you're not looking for. We'll do some more examples that use monad transformers later in the quarter, and will even take a quick peek at some of the wild magic that the Monad Transformer Library (mlt) uses to simplify both the coding and the use of monad transformers.

Historical Note

The use of parser combinators to build backtracking parsers has a long history in functional programming. A particularly formative (and illustrative) article is Functional Parsers (PDF) by Joroen Fokker. Note that Fokker's code isn't actually Haskell—it's Gofer, a predecessor language. You shouldn't have any trouble with translation. Another good source for this material is Chapter 13 of Hutton (2nd edition).