Lecture 18: Introduction to Functional Parsing
A parser is a function/procedure that translates a String
representing a value of type a
into that value. It's tempting to propose the following as the definition of an abstract Parser
type:
type Parser s = String -> s
The problem here is that if we're building a parser out of pieces, the pieces are going to be sub-parsers that consume part but not all of the input. This suggests the following:
type Parser s = String -> (s,String)
where we're returning a pair that consists of the result of parsing an initial segment of the input string, and the unparsed remainder. This may be reminiscent of the State
monad, specialized to String
. Such a definition would allow us to write code like this:
pairp :: Parser a -> Parser b -> Parser (a,b)
pairp ap bp s = ((a,b),u) where
(a,t) = ap s
(b,u) = bp t
or even, if we somehow monadify Parser
along the lines of State
,
pairp :: Parser a -> Parser b -> Parser (a,b)
pairp ap bp = do
a <- ap
b <- bp
return (a,b)
or, after putting our applicative thinking hats on, even
pairp = liftA2 (,)
both of which hint at things to come. Unfortunately, though, this definition of Parser
isn't quite robust enough. Consider, e.g., the special case of trying to parse a simple arithmetic expression. Suppose we had
data Expression
= Const Double
| Add Expression Expression
| Mul Expression Expression
expressionParse :: Parser Expression
What should expressionParse "1+2*3"
return? Obviously, we're looking for the result
(Add (Const 1) (Mul (Const 2) (Const 3)), "")
But what about
(Const 1,"+2*3")
, or(Add (Const 1) (Const 2), "*3")
?
All are plausible, in that they meet the contract for expressionParse
, even though the last one is a bit problematic. We don't just have one right answer, it seems we have three! So let's return a list:
type Parser s = String -> [(s,String)]
Of course, we're going to want to make Parser
an instance of various standard typeclasses, so we can't get away with an unboxed type alias as above, but will anticipate later developments by using newtype
to define a virtually boxed type:
newtype Parser s = Parser { runParser :: String -> [(s,String)] }
The result looks like a mash-up of a State
and a []
.
Following Fokker, we now build some simple parsers, e.g., satisfy
“shifts” the first character of input if it satisfies the argument predicate:
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser $ \s -> case s of
[] -> []
a:as
| p a -> [(a,as)]
| otherwise -> []
Note here the spiffy use of guards within patterns within a case statement. There's a really cute way to clean up that the last little bit, which is called the Fokker trick:
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser $ \s -> case s of
[] -> []
a:as -> [(a,as) | p a]
How does this work? If p a
is false, we end up returning the []
type's mzero
which is just []
, so [(a,as) | p a]
is either [(a,as)]
or []
, according to whether p a
is True
or False
, respectively. It's a cute trick, and well worth remembering.
We can use satisfy
to define a number of additional simple parsers
char :: Char -> Parser Char
char c = satisfy (c==)
alpha, digit, space :: Parser Char
alpha = satisfy isAlpha
digit = satisfy isDigit
space = satisfy isSpace
*Exercise 18.1 Show by a series of principled transformations that we can define:
char :: Char -> Parser Char
char = satisfy . (==)
The character predicates from Data.Char
all beg to be turned into simple parsers, in similar fashion.
Next, we have a simple parser that recognizes a string:
string :: String -> Parser String
string str = Parser $ \s -> [(t,u) | let (t,u) = splitAt (length str) s, str == t]
Here, again, we use the Fokker trick, creating either a zero- or one-element list. A remarkable fact is that this pretty much concludes our effort to build primitive parsers, further progress is going to take the form of adding Parser
to various standard typeclasses, leveraging the power of these simple functions.
We'll start by considering the task of writing a Parser Bool
. We can start by considering a couple of primitive parsers for recognizing the strings "True"
and "False"
.
parseTrue = string "True"
parseFalse = string "False"
An obvious first problem is that these parsers both have type Parser String
, rather than Parser Bool
. To get a Parser Bool
, it's convenient to make Parser
an instance of Functor
.
instance Functor Parser where
fmap f p = Parser $ \s ->
[(f a,t) | (a,t) <- runParser p s]
With the Functor
instance in hand, we can write:
parseTrue = fmap (const True) (string "True")
parseFalse = fmap (const False) (string "False")
and perhaps recognize a useful pattern here:
token :: String -> a -> Parser a
token s a = const a <$> string s
parseTrue, parseFalse :: Parser Bool
parseTrue = token "True" True
parseFalse = token "False" False
With this,
> runParser parseTrue "True"
[(True,"")]
which isn't exactly what we want, but it's a solid step in the right direction.
Of course, we don't want to be able to parse "True"
or parse "False"
, we want to be able to parse a string that contains either. Somehow, we want to combine the two parsers into one parser, merging their outputs. There are a number of ways to approach this. We could make Parser s
an instance of Monoid
, or we could make Parser
an instance of Alternative
. Indeed, we could do both. There is an apparent cost associated with making Parser
an instance of Alternative
, and that is that Alternative
requires Applicative
. But a moment's consideration should suggest that we're going to want to combine the outputs of multiple parsers, as hinted at with the pairp
example, and we know that the easiest way to provide generalizations of fmap
to functions of greater arity is through Applicative
anyway. So...
instance Applicative Parser where
pure a = Parser $ \s -> [(a,s)]
af <*> aa = Parser $ \s ->
[ (f a,u)
| (f,t) <- runParser af s
, (a,u) <- runParser aa t
]
instance Alternative Parser where
empty = Parser $ \s -> []
p1 <|> p2 = Parser $ \s ->
runParser p1 s ++ runParser p2 s
With this in hand, not only does our pairp
example work, we can finish the Parser Bool
:
parseBool :: Parser Bool
parseBool = token "True" True <|> token "False" False
An alternative solution, well worth remembering in other contexts, is
parseBool = read <$> (string "True" <|> string "False")
At this point, it's standard practice to introduce parser combinators many
and some
, which given a Parser a
return a Parser [a]
, where the many
version returns a list of zero or more successful parses, and some
a list of one or more parses. It's natural to write these via mutual recursion, as
many, some :: Parser a -> Parser [a]
many p = some p <|> pure []
some p = liftA2 (:) p (many p)
Much to our surprise, attempting to do so results in an error: many
and some
are already defined in Control.Applicative
, with the type signature
many, some :: Alternative f => f a -> f [a]
Moreover, a quick consideration of our implementation reveals that it only uses functions in the Alternative
or Applicative
typeclasses. Unsurprisingly, we've just reimplemented these standard functions! Likewise, it's often useful to have an optional
parser combinator, which is used for elements that may be present, but it too is already available in Control.Applicative
, simplifying our work considerably.
For example:
> runParser (many parseBool) "TrueTrue!"
[([True,True],"!"),([True],"True!"),([],"TrueTrue!")]
> runParser (some parseBool) "TrueTrue!"
[([True,True],"!"),([True],"True!")]
> runParser (optional parseBool) "TrueTrue!"
[(Just True,"True!"),(Nothing,"TrueTrue!")]
We can now consider a mildly non-trivial parsing problem. Consider the type
data IntV = IntV Int deriving (Show)
This is just an Int
in a box. Can we parse this given its natural syntax?
parseInt = read <$> some digit
skipSpaces = const () <$> many space
parseIntV = liftA3 (\_ _ i -> IntV i) (string "IntV") skipSpaces parseInt
For example:
> runParser parseIntV "123!"
[]
> runParser parseIntV "IntV 123!"
[(IntV 123,"!"),(IntV 12,"3!"),(IntV 1,"23!")]
A nice consequence of writing parseIntV
is that it enables us to add IntV
to the Read
class, via
instance Read IntV where
readsPrec _ = runParser parseIntV
Which we can test with
> read "IntV 18" :: IntV
IntV 18
Sweet! Of course, we could have also achieved the ability to read IntV
values like this by including Read
in the deriving
clause of the type definition, but explicitly defining the Read IntV
instance allows us to choose a different syntax if desired. Note that the wildcard argument to readsPrec
is a precedence level, which means that we can have different parsers associated with different parsing contexts, a capability that's not often used.
There's a small gap here, in that the parsers we're building return multiple results, whereas read
returns only a single result. The bridge is in terms of code that filters the result list for those pairs (a,t)
where the unparsed string t
consists only of whitespace, and then requires that there be only a single result of that form, e.g.,
parseWith :: Parser a -> String -> a
parseWith p s = case [a | (a,t) <- runParser p s, all isSpace t] of
[a] -> a
[] -> error "no parse"
_ -> error "ambiguous parse"
It's reasonable to argue that this might have been better handled by returning a value of type Either String a
than by throwing an exception via error
. When you write your own language, be sure to do it that way.
Monadic Parsing
Historically, functional programmers approached parser combinators through Monad
rather than through
Applicative
and friends. It's worth noting that we can make Parser
into a monad, and this has the considerable advantage of making monadic do
-notation available to us, although pending changes to GHC/Hackage will permit the use of restricted versions of do
with Applicatives, largely eliminating this advantage.
instance Monad Parser where
return a = Parser $ \s -> [(a,s)]
p >>= g = Parser $ \s ->
[ (b,u) | (a,t) <- runParser p s
, (b,u) <- runParser (g a) t ]
instance MonadPlus Parser where
mzero = empty
mplus = (<|>)
With these definitions, we could have written
parseIntV :: Parser IntV
parseIntV = do
string "IntV"
skipSpaces
i <- parseInt
return $ IntV i
which has the advantage of conceptual simplicity. That said, contemporary practice favors using Applicative
based constructors over Monad
based constructors, and the results argue for themselves in terms of concision if not always clarity.
*Exercise 18.2 Consider the following type declaration:
data ComplexInt = ComplexInt Int Int
deriving (Show)
Use Parser
to implement Read ComplexInt
, where you can accept either the simple integer syntax "12"
for ComplexInt 12 0
or "(1,2)"
for ComplexInt 1 2
, and illustrate that read
works as expected (when its return type is specialized appropriately) on these examples. Don't worry (yet!) about the possibility of minus signs in the specification of natural numbers.
This is cool stuff, and we can and will take this style of parsing very far. This isn't our father's Fortran, nor indeed our high school Java or Scheme.
With all of this work behind us, the end product is remarkably concise and powerful parser combinator library that requires less than a full page of code. Just to be clear here: we haven't written a parser in less than a full page of code, we've written a module for writing parsers in less than a full page of code. And we're not done yet.
- Parser.hs, as above.
Ἀπὸ μηχανῆς θεός
Remember back when we remarked that original definition of Parser
looked like a mashup of a State
monad and a []
monad? We can breathe life into this observation by noting that Control.Monad.State
ultimately defines State
in terms of the monad transformer StateT
. Looking at the definition of StateT
, we have
newtype StateT s m a = StateT { runStateT :: s -> m (a,s) }
So, if we define
type Parser = StateT String []
we at least get the type right. But here's where the wild magic begins: if we look at the instances provided by StateT
, we'll see:
Functor m => Functor (StateT s m)
(Functor m, Monad m) => Applicative (StateT s m)
(Functor m, MonadPlus m) => Alternative (StateT s m)
Monad m => Monad (StateT s m)
MonadPlus m => MonadPlus (StateT s m)
In the current context, m
is []
, and we know that []
is a Functor
, a Monad
, and a MonadPlus
, and as such, our Parser
type will be an instance of Functor
, Applicative
, Alternative
, Monad
, and MonadPlus
for free! Moreover, once we have these instances, we can use combinators for these typeclasses to simplify our base definitions, e.g.,
string :: String -> Parser String
string = mapM char
Pulling this all together, we have
- Parser.hs, via Monad Transformers, in 22 lines.
Try that in C++ or Java ;-).
The sudden appearance of monad transformers here very much has the flavor of a deus ex machina, and you're all very much to be forgiven if you're not only saying to yourself, “I didn't see that coming,” but “Is he really expecting that I'm going to be able to pull this kind of wild rabbit out of my hat?” No. Or at least, not yet. In truth, I gave a predecessor version of this lecture for three consecutive years without invoking this wild magic, only realizing in the summer before the fourth year that not only could the previously mysterious monad transformers be used here, but they resulted in a Parser
class that was both more concise and more powerful. It was a programmer's epiphany—to paraphrase the great Hungarian combinatorialist Paul Erdös, I felt as if I'd been granted a peak at a page of code from God's own repository.
But that said, you're unlikely to find what you're not looking for. We'll do some more examples that use monad transformers later in the quarter, and will even take a quick peek at some of the wild magic that the Monad Transformer Library (mlt) uses to simplify both the coding and the use of monad transformers.
Historical Note
The use of parser combinators to build backtracking parsers has a long history in functional programming. A particularly formative (and illustrative) article is Functional Parsers (.ps) by Joroen Fokker. Note that Fokker's code isn't actually Haskell—it's Gofer, a predecessor language. You shouldn't have any trouble with translation. Another good source for this material is Chapter 8 of Hutton.