Functional Parsing

We've seen several example instances of Show (converting a value to String). Now we'll start implementing functions that implement Read values from Strings.

With calculator labs, used lots of string- (or list-) manipulating functions. In general, want to build parsers for complicated syntactic grammars in more principled ways. A lot of times, need to do much more backtracking.

As you'll see in most compilers courses, the the traditional way to write a parser is to break it up into lexing (which breaks a string into a stream of tokens) and parsing (which consumes tokens according to some language grammar). Domain-specific languages for lexers and parsers are often used to generate code that integrates with the rest of the compiler implementation (e.g. type checking, evaluation and/or code generation).

In contrast, we'll see an approach to parsing that builds and composes the kinds of functional programming building blocks we have been learning.

Parser

First cut at Parser type, like State:

type Parser a = String -> (a, String)

However, we'll often need to return multiple possible parses:

type Parser a = String -> [(a, String)]

And, as we've done many times before, create a new type...

data Parser a = Parser (String -> [(a, String)])

... without the run-time overhead.

newtype Parser a = Parser (String -> [(a, String)])

And we'll use records to give a name for unwrapping the function inside:

newtype Parser a = Parser { runParser :: String -> [(a, String)] }

So, we've got a "combination" of State and List. We'll talk more about this another day.

Parsing Characters

char :: Char -> Parser Char
char c = Parser $ \s -> case s of
    []     -> []
    (a:as) -> if c == a
                 then [(a,as)]
                 else []

> runParser (char 'C') "Cool"
> runParser (char 'C') "Not Cool"

Clean this up with a cute pattern called Fokker's trick:

char c = Parser $ \s -> case s of
    []     -> []
    (a:as) -> [(a,as) | c == a]

Exercise: Expand out the list comprehension in terms of list-manipulating functions.

We can make this work more generally:

satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser $ \s -> case s of
    []     -> []
    (a:as) -> [(a,as) | p a]

char c = satisfy (c==)
alpha  = satisfy isAlpha
digit  = satisfy isDigit
space  = satisfy isSpace

> runParser alpha "Hello!"
> runParser alpha "!Hello!"

Parsing Strings

Okay, now let's match against sequences of Chars (i.e. Strings).

string :: String -> Parser String
string str = Parser $ \s ->
    let (t,u) = splitAt (length str) s in
    if str == t then [(t,u)] else []

Again, use Fokker's trick:

    [(t,u) | let (t,u) = splitAt (length str) s, str == t]

> runParser (parse "Hello") "Hello!"
> runParser (parse "Hello") " Hello!"

Parsing Tokens

A token is a string in the source code that has some meaning. For example, in Haskell, the token True refers to one of the data constructors of the Bool datatype, and the token let marks the beginning of a let-expression.

The token parser defines how to assign meaning to a String token:

token :: String -> a -> Parser a
token str a = Parser $ \s ->
    [(a,u) | (t,u) <- runParser (string str) s]

parseTrue  = token "True" True
parseFalse = token "False" False

> runParser parseTrue "True"
> runParser parseTrue "False"

Parser Combinators

We can now easily combine smaller parsers into bigger ones using the Functor, Applicative, and Monad interfaces.

instance Functor Parser where
 -- fmap :: (a -> b) -> Parser a -> Parser b
    fmap g (Parser f) = Parser $ \s -> [(g x, t) | (x,t) <- f s]
    fmap g f = Parser $ \s -> [(g x, t) | (x,t) <- runParser f s]

We can now rewrite token more succinctly:

token str tok = fmap (\_ -> tok) (string str)

Or, using const :: a -> b -> a:

token str tok = fmap (const tok) (string str)

Or:

token str tok = const tok <$> string str

So far, we can parse True or parse False. Want to be able to combine parsers into one for Booleans. Enter monoids (in particular, Alternative). First need to define Applicative.

instance Applicative Parser where
 -- pure :: a -> Parser a
    pure a = Parser $ \s -> [(a,s)]

 -- (<*>) :: Parser (a -> b) -> Parser a -> Parser b
    pab <*> pa = Parser $ \s ->
      [ (f x, u) | (f,t) <- runParser pab s
                 , (x,u) <- runParser pa t ]

instance Alternative Parser where
 -- empty :: Parser a
    empty = Parser $ \s -> []

 -- (<|>) :: Parser a -> Parser a -> Parser a
    (Parser f) <|> (Parser g) = Parser $ \s -> f s ++ g s

Now, behold:

parseBool :: Parser Bool
parseBool = parseTrue <|> parseFalse

> runParser parseBool "TrueFalse"
> runParser parseBool "FalseTrue"

Cool!

Useful: one-or-more matches, or zero-or-more, or zero-or-one. Control.Applicative defines the following (some and many are a bit clever, using pure and (<|>)):

some :: Alternative f => f a -> f [a]
many :: Alternative f => f a -> f [a]
optional :: Alternative f => f a -> f (Maybe a)

> runParser (some parseBool) "TrueTrue!"
> runParser (many parseBool) "TrueTrue!"
> runParser (optional parseBool) "TrueTrue!"

The Alternative class does buy us something!

And then:

instance Monad Parser where
 -- (>>=) :: Parser a -> (a -> Parser b) -> Parser b
    pa >>= f = Parser $ \s ->
      [ (y,u) | (x,t) <- runParser pa s
              , (y,u) <- runParser (f x) t ]

Example with do-notation (even though only the Applicative interface is needed):

parseThreeBools :: Parser (Bool, Bool, Bool)
parseThreeBools = do
  b1 <- parseBool
  b2 <- parseBool
  b3 <- parseBool
  return (b1, b2, b3)

Unambiguous Parse

Compared to read, one can define a version that deals with errors explicitly rather than crashing:

maybeRead :: Parser a -> String -> Either String a
maybeRead p s =
    case runParser p s of
      [(a,"")]     -> Right a
      [(a,suffix)] -> Left $ "suffix not consumed: [" ++ suffix ++ "]"
      []           -> Left $ "no successful parses"
      _            -> Left $ "multiple successful parses"

Source Files