Lecture 18: Practical Functional Parsing

`Text.ParserCombinators.ReadP`

Having gone to all of the trouble of writing our own parser combinator library, we'll now set it aside and use a similar but much more developed and efficient combinator library that comes with the Haskell Platform: Text.ParserCombinators.ReadP. Note that there are several other parser libraries for Haskell, and that ReadP isn't necessarily the most popular, it's just the best pedagogical fit for us right now.

First, the good news. May familiar parser combinators from Parser are available to us in ReadP, notably, satisfy, char, and string, as well as many other conveniences. ReadP also belongs to all of the type classes that you'd expect: Functor, Applicative, Alternative, Monad, MonadFail, and MonadPlus.

The bad news is that when ReadP was written, functional parsing was usually referred to as monadic parsing, and many of the other type classes hadn't yet been formalized. So there is an occasional awkwardness, in that ReadP in some ways anticipated Applicative, Alternative and other type classes that didn't exist when it was written, and so in some ways is duplicative of them. In particular, ReadP defines many with the same meaning as Applicative's many, but conflicting with it; and many1 which is synonymous with Applicative's some. ReadP defines (+++) which is synonymous with <|>, while pfail is synonymous with empty.

A mildly annoying difference is that the “run” function is readP_to_S rather than runParser. If it's sufficiently annoying, just define runParser = readP_to_S, otherwise live with it. We'll live with it.

ReadP was written with a concern for efficiency, and this leads us to consider a couple of new parser combinators.


    (<++) :: ReadP a -> ReadP a -> ReadP a

is described as a left-biased alternative. The idea is that pa <++ pb is a parser that tries pa first. If it returns any results, those are the results, but if it fails, it next tries pb. We can imagine that (<++) is defined as


    ap <++ bp = Parser $ \s ->
        case runParser ap s of
            [] -> runParser ab s
            rs -> rs

although the actual definition is quite different, as the underlying representations are more complex. Next, we have:


    munch :: (Char -> Bool) -> ReadP String

This function returns the largest substring of the string to be parsed that satisfy the predicate. This is subtly different from many (satisfy p), cf.


    > readP_to_S (munch (=='a')) "aaabb"
    [("aaa","bb")]
    > readP_to_S (many $ satisfy (=='a')) "aaabb"
    [("","aaabb"),("a","aabb"),("aa","abb"),("aaa","bb")]

This often doesn't make any difference at all (as only the longest version can be valid), but there can be a quadratic savings in not generating the shorter forms.

Note that there's a variant munch1 of munch which succeeds only if there is at least one character in the result.

Becoming an Expert

The module Text.ParserCombinators.ReadP contains many functions beyond those that we implemented in our Parser module. A good strategy for building expertise in Haskell is to read through the documentation for modules like Text.ParserCombinators.ReadP, and try to implement the various functions it contains. Then, follow the links to the source code, and see how Haskell experts have implemented them. This will give you practice, the opportunity to read and learn from experts, and also a close acquaintance with the facilities the module provides.

Example: Duplication With Variation

It's easy to duplicate a file, that's what the Unix cp utility is for. But what if we want to produce a bunch of near-duplicates of a given text, i.e., we want those duplicates to vary somehow? If the variations are sufficiently simple, we can write a program that generates all of the variations, and produces the duplicated text. But often, we'll have some sort of database, and the variations will amount to the rows of one of the relations of that database. In this case, we'll want to be able to handle a file that contains our data, and CSV (comma separated values) with headers is a natural choice.

If we're only interested in duplicating a particular fixed base text, we incorporate that text into the logic of the program we're writing. But this is the sort of job where the text to be duplicated tends to vary over time too, and this leads to a preference to “move it out of the code.” So we'll create a simple text format to describe the underlying text.

As an aside, the code we're studying is a simplified version of the spam program that Professor Kurtz has used to send out wait-list notifications and similar bulk emailings.

`DupV.Template`

Our templates will consist of ordinary text files, in which set-braces are used to indicate a placeholder to be filled in from our data file. For example, one of the test files for this program contains:


    The Arabic numeral {arabic} and the Roman numeral {roman} both represent {english}.

The idea here is that {arabic}, {roman}, and {english} will all be filled in with data from our CSV file.

Our internal representation of a Template will be:


    data Template = Template { items :: [TemplateItem] }
        
    data TemplateItem 
        = Literal String
        | Variable String

Our first programming task is to parse the input file, e.g., obtaining


    Template {items =
        [ Literal "The Arabic numeral "
        , Variable "arabic"
        , Literal " and the Roman numeral "
        , Variable "roman"
        , Literal " both represent "
        , Variable "english"
        , Literal ".\n"
        ]}

We'll use ReadP:


    parseTemplate :: ReadP Template
    parseTemplate = Template <$> many parseTemplateItem where
        parseTemplateItem = parseLiteral <++ parseVariable
        parseVariable = Variable <$>
            (char '{' *> munch1 isAlphaNum <* char '}')
        parseLiteral  = Literal <$> munch1 (`notElem` "{}")

This requires a bit of explanation. We use many to parse a list of items. The items to be parsed are TemplateItems, which come in one of two forms. We write sub-parser for each, using (<++) to avoid a parse that must fail if the preceding parse succeeded. There's some subtly in both subparsers.

Our use of munch1 in parseLiteral is important! If we just used munch, the parser would succeed in producing a Literal "" without reading any input, and in the context of many would result in infinitely many Literal ""'s at the end of the string (as well as at each transition to a variable). Pragmatists will note that parsers should always consume some input, otherwise bad things can happen!

Our definition of parseVariable uses two new applicative operators, <* and *>. These are sequencing operators, which return the value of the first, or second, argument respectively. Note that the relational operator “points towards” the applicative value we will keep. In the old days, we'd have written parseVariable in a monadic style, which is superficially quite different, but essentially the same:


    parseVariable = Variable $ do
        char '"'
        result <- munch1 isAlphaNum
        char '"'
        pure result

It is perfectly reasonable to write parsers this way when you're trying to figure things out, but it's useful to keep in mind that if a bound variable never appears on the right-side of a binding, it should be possible to rewrite the code applicatively. With practice, the applicative forms come first.

We can use the ReadP parser to make Template an instance of the Read type class,


    instance Read Template where
        readsPrec _ = readP_to_S parseTemplate

and then introduce the convenience function


    loadTemplateFile :: FilePath -> IO Template
    loadTemplateFile path = read <$> readFile path

Finally, we have a simple bit of code for instantiating a template, given a list of keys and a list of values:


    instantiate :: Template -> [String] -> [String] -> String
    instantiate template header record =
        let dict = Map.fromList $ zip header record
            fill (Literal s) = s
            fill (Variable v) = dict ! v
        in concat . map fill . items $ template

This is simple, yet powerful code. We build a Map, an efficient structure for manipulating key-value pairs, out of an association list built out of the keys and values. We define a fill :: TemplateItem -> String that “evaluates” a TemplateItem in the context of the Map we just built. Finally, we map the fill function across the items of the template, obtaining a [] of String, which we flatten using concat.

`DupV.SimpleCSV`

The CSV format is deceptively simple, and parsing any individual CSV file is usually straightforward. A CSV file consists of a sequence of newline terminated records, where each record is a sequence of comma separated fields, and each field is just a String. So, what's hard? CSV began life as an informal format. There is an Internet Engineering Task Force (IETF) Received for Comment (RFC, a.k.a., and internet standards document) RFC 4180 that describes CSV, but that “standard” is based on reverse engineering files that claimed to be CSV, so the cart of practice came before the horse of specification. And my experience of CSV includes files that don't meet the “standard” of RFC 4180, a very real caveat for anyone who emptor's it. So writing a good, general CSV parser has real challenges.

We'll start by writing a simplified ReadP parser, and then deal with some of the complexities of CSV.


    newtype CSV = CSV { content :: [[String]] }

    instance Read CSV where
        readsPrec _ = readP_to_S parseSimpleCSV

    parseSimpleCSV :: ReadP CSV
    parseSimpleCSV = CSV <$> record `endBy` newline <* eof where
        newline = string "\n"
        record = field `sepBy` char ','
        field =  munch (`notElem` ",\n")

There's a bit of complexity here.

As before, we're going to take advantage of the Read type class, and so introduce use newtype to wrap the type we're interested in. But read can ignore white space at the end of its input, and in the case of CSV parsing, this can result in an ambiguity, so we use the eof parser from ReadP to ensure that the entire input string is read.

The endBy parser combinator builds lists of values from a parser for the values and their terminators. There is a similar sepBy parser combinator for building lists of values from a parser for the values and a parser for the separators. The expression record `endBy` newline is a parser for a list of records, each of which must be terminated by a newline, while field `sepBy` char ',' is a parser for a list of fields separated by commas.

This would be good enough, if

we were only going to run this code on Unix-like systems, and
we never needed a comma or a newline within a field value.

Unfortunately, these are not assumptions we want to make. So there's a bit of work to do.

Newlines

Operating systems don't agree as to what precise sequence of characters constitutes a newline. Unix uses a bare linefeed (LF) "\n", a.k.a. ASCII 012 (that's an octal codepoint). MacOS Classic used a bare carriage return (CR) "\r", a.k.a., ASCII 015. MacOS X follows the Unix convention as befits its Unix foundations. Finally, Windows follows a convention that is as old as teletype machines, and relies on a CRLF pair, "\r\n". And CSV is such a basic file format that it could have come from anywhere, and may have passed through many hands, so we can't even be entirely confident that the same newline convention will be used consistently within a single file, although we expect that exceptions to this will be rare.

We'll take a pragmatic point of view. We may have to deal with Windows, Linux, and MacOS X, but we're not going to have to deal with MacOS Classic. So we can define


    newline = string "\n" <++ string "\r\n"

Yes, this is biased towards Unix. I like it that way.

Commas

To deal with the issue of commas and newlines within fields, CSV has the notion of a quoted field: this is a field that begins and ends with a '"', and can contain anything but a '"' within it. To avoid any ambiguity, CSV forbids simple (non-quoted) fields from containing a '"'. Thus


    field = quotedField <++ simpleField
    simpleField = munch (`notElem` ",\n\"")
    quotedField = char '"' *> munch (/='"') <* char '"'

Of course, this just trades one problem for another. We can now have commas in our fields, but we can't have double-quotes. To deal with this, CSV further allows a quoted double-quote within a quoted field. How do you quote a double-quote? In CSV, you repeat it. Thus, e.g.,


    "He said, ""Foo!"""

To accommodate this, we'll replace the munch (/='"') within the definition of quotedField as follows:


    quotedField = char '"' *> many quoteChar <* char '"'
    quoteChar = satisfy (/= '"') <++ (string "\"\"" $> '"')

Note the useful $> operator from Data.Functor. The expression (string "\"\"" $> '"') means that if we successfully parse a repeated double-quote, the value we should return is a single double-quote.

*Exercise 18.1 The use of many in the definition of quotedField introduces the very problem that munch was written to solve. The problem is that we can't use munch, as it has the wrong type (it acts on a character predicate, rather than general parser). Write the function


    greedy :: ReadP a -> ReadP [a]

which greedily parses a sequence of values, returning only maximal sequences. Note that "maximal" in this context does not mean that list has the greatest possible length, nor that it ingests the maximum number of input characters, but rather that it cannot be extended. E.g.,


    > readP_to_S (greedy (string "a" +++ string "ab")) "abaac"
    [(["a"],"baac"),(["ab","a","a"],"c")]

Hint: the (<++) parser combinator is very helpful!

Main

Our main is


    main :: IO ()
    main = do
        args <- getArgs
        case args of
            [variationFile,templateFile] -> do
                template <- loadTemplateFile templateFile
                header:variations <- content <$> loadSimpleCSV variationFile
                putStr . concat . (`map` variations) $ \variation ->
                    instantiate template header variation
            _ -> do
                hPutStrLn stderr "Usage: spam variationFile templateFile"
                exitFailure

This is mostly straightforward.

Real Programming

There is a good overview of this process on the HaskellWiki at haskell.org, How to write a Haskell program. The name notwithstanding, this wiki page describes how to package and distribute a Haskell program or library.

I recommend using git for version control, cabal for builds, and haddock for documentation.

Real programs often try to move as much of their data into data files as possible (we've done this), and as much of their internal logic into modules organized around basic data structures an the algorithms that support them. This simplifies code reuse.

Real programs also require a means of distribution. For this particular program, the sources are on GitHub: stuartkurtz/DupV, and the program can easily be downloaded and installed from there.

Exercise 18.2 One way this program can be made much more useful is to add a number of pre-defined variables, e.g., the time and date of processing. Modify the dupv program to include such a feature.

CMSC-16100

Honors Introduction to Programming, I

Autumn Quarter, 2019