Lecture 19: Practical Functional Parsing

Having gone to all of the trouble of writing our own parser combinators, we'll now set that aside and use a similar but much more developed and efficient combinator library that comes with the Haskell Platform: Text.ParserCombinators.ReadP. Note that there are several other parser libraries for Haskell, and that ReadP isn't necessarily the most popular, it's just the best pedagogical fit for us right now. We can pretty much use ReadP as a plug-in replacement for Parser, but there a few things we should change to make our use of it more efficient:

readP_to_S replaces runParser
(+++) is an alternative to (<|>), but more significantly, (<++) can be used for left-biased sum, where the second parser is used only if the first parser fails
Test.ParserCombinator.ReadP has its own version of many, which can conflict with Control.Applicative's. ReadP also defines many1, which is just some for ReadP
ReadP has a combinator munch which reads a sequence of characters greedily

For example, if we assume


    runParser :: ReadP a -> ReadS a
    runParser = readP_to_S
    
    token :: String -> a -> ReadP a
    token str a = fmap (const a) (string str)

we can contrast ordinary with left-biased alternation by


    > runParser (token "a" 'a' +++ token "a" 'b') "a"
    [('a',""),('b',"")]
    > runParser (token "a" 'a' <++ token "a" 'b') "a"
    [('a',"")]

Likewise, we can illustrate the greedy nature of munch with many by contrasting

> runParser (R.many . satisfy $ isAlpha) "abc123" [("","abc123"),("a","bc123"),("ab","c123"),("abc","123")]

with


    > runParser (munch isAlpha) "abc123"
    [("abc","123")]

Note that we used import Text.ParserCombinators.ReadP as R (and import Control.Applicative as A) to allow us to work around the conflict over many, and also that munch and many, while often thought of together, take very different types of arguments (a character predicate for munch, an arbitrary parser for many).

*Exercise 19.1 Extend Parser.hs to include (<++).

*Exercise 19.2 Consider the following type declaration:


    data ComplexInt = ComplexInt Int Int
        deriving (Show)

Use Text.ParserCombinators.ReadP to implement Read ComplexInt, where you can accept either the syntax "12" for ComplexInt 12 0 or "(1,-2)" for ComplexInt 1 (-2). This time, as the example indicates, I also want you to worry about the possibility of a leading minus sign before an integer. Hint: consider using the option parser combinator from Text.ParserCombinators.ReadP.

Becoming an Expert

The module Text.ParserCombinators.ReadP contains many functions beyond those that we implemented in our Parser module. A good strategy for building expertise in Haskell is to read through the documentation for modules like Text.ParserCombinators.ReadP, and try to implement the various functions it contains. Then, follow the links to the source code, and see how Haskell experts have implemented them. This will give you practice, the opportunity to read and learn from experts, and also a close acquaintance with the facilities the module provides.

Let's consider a parsing task that seems simple at first, but which is really a good deal more difficult than it might appear: parsing comma separated values (CSV).

The format is deceptively simple, and parsing any individual CSV file is usually straightforward. A CSV file consists of a sequence of newline terminated records, where each record is a sequence of comma separated fields, and each field is just a String. So, what's hard? CSV began life as an informal format. There is an Internet Engineering Task Force (IETF) Received for Comment (RFC, a.k.a., and internet standards document) RFC 4180 that describes CSV, but that “standard” is based on reverse engineering files that claimed to be CSV, so the cart of practice came before the horse of specification. And my experience of CSV includes files that don't meet the “standard” of RFC 4180, a very real caveat for anyone who emptor's it. So writing a good, general CSV parser has real challenges.

There are (at least) three practical issues that intervene here.

Operating systems don't agree as to what precise sequence of characters constitutes a newline. Unix uses a bare linefeed (LF) "\n", a.k.a. ASCII 012 (that's an octal codepoint). MacOS Classic used a bare carriage return (CR) "\r", a.k.a., ASCII 015. Modern MacOS X follows the Unix convention as befits its Unix foundations, although this cannot be completely relied on as legacy Classic files persist. Finally, Windows follows a convention that is as old as teletype machines, and relies on a CRLF pair, "\r\n". And CSV is such a basic file format that the source could have come from anywhere, and may have passed through many hands, so we can't even be entirely confident that the same newline convention will be used consistently within a single file, although we expect that exceptions to this will be rare.
The second problem is that there are two types of fields: unquoted fields and quoted fields. Unquoted fields can contain any characters except commas, double quotes, CR or LF. Quoted fields are contained in double quotes (with no spaces before or after), and can contain any character, except that a double quote must be quoted—which is done by duplicating it.
Finally, as the RFC documents, the trailing newline may be omitted. This reflects a lack of consensus on the question as to whether newlines are record terminators, as claimed above, or record separators. This lack of consensus represents a serious challenge for CSV parsers, as we'll soon see.

For the sake of the discussion below, we'll say that a CSV record is trivial if it consists of a single, empty field. Trivial records are going to be the bane of our efforts to produce a truly general-purpose CSV parser, because ambiguities in interpretation often boil down to the question of whether a trivial record was intended at some point in the record sequence. It's tempting to just suppress all trivial records, thereby eliminating the intrinsic ambiguity in CSV, but it isn't really possible to rule out the possibility that the CSV author intended to include trivial records, e.g., to use a single CSV file to represent several distinct relations, segmented by trivial records.

With all this on our minds, let's get started. It's easiest to build the parser from the bottom up. First, let's deal with that pesky newline issue:


    import Text.ParserCombinators.ReadP

    newline :: ReadP String
    newline = string "\n" <++ string "\r\n" <++ string "\r"

This is very intentional code. It is essential that we put the string "\r\n" before the string "\r", and that we use the short-circuiting (<++) between them, otherwise we'll recognize the CRLF sequence as consisting of two newlines, which would cause us to conjure into being a trivial record between the two characters of what's supposed to be an indivisible sequence representing a newline. And while there exists a tenuous theoretical possibility that a file may have been edited by hand first on a MacOS X classic system, and then subsequently on a Unix system, in such a way a trivial record actually was intended, it's hard to imagine this actually happening in practice. After all, the people who edit/produce CSV have to live with themselves and their files. The far, far, far more likely explanation for a CRLF pair is that the CSV file was produced on a Windows system, and that it is intended to represent a single newline.

A next note is that we're combining parsers using ReadP's left-biased sum (<++), in which a successful parse by the left-hand summand precludes any attempt to parse using the right-hand summand. This short-circuiting asymmetry creates a performance incentive for putting the parser that's more likely to succeed first, or to put this slightly differently, we pay a performance penalty for putting a parser that's going to fail before a parser that's going to succeed, because if the successful parser runs first, (<++) won't bother to waste any computational effort in making the failing parser fail. I run on Unix systems, so I want the Unix-favoring alternative first. It's good to be the maintainer.

Next, we'll deal with fields. First, simple, unquoted fields:


    simpleField :: ReadP String
    simpleField = munch (`notElem` ",\"\n\r")

The parser combinator munch is greedy, i.e., it will read as many characters as possible that satisfy the predicate, and so it can be significantly more efficient to use munch p than many $ satisfy p. In this case, we're going to read all of the characters that are permitted in a simple field, i.e., everything but a comma, a double-quote, an LF, or a CR. A quoted field must begin and end with quotes.

Next up, quoted fields...


    nextChar :: ReadP Char
    nextChar =  satisfy (/='"') <++ fmap (const '"') (string "\"\"")
    
    quotedField :: ReadP String
    quotedField = do
        char '"'
        result <- many nextChar
        char '"'
        return result

The nextChar parser returns the next character to be parsed, collapsing doubled double quotes into a single double quote. The quotedField function reads a double quote, a run of characters produced by nextChar, and then a terminal double quote. Now it happens that this particular pattern—read something, read and remember something, read something, then return the remembered thing—happens a lot in parsing (think parenthesized expressions). So ReadP has a function between :: ReadP open -> ReadP close -> ReadP a -> ReadP a that captures this pattern, allowing us to simplify the definition of quotedField to


    quotedField :: ReadP String
    quotedField = between (char '"') (char '"') (many nextChar)

We can now combine these two specialized field parsers into a general field parser:


    field :: ReadP String
    field = simpleField <++ quotedField

With any luck, we'll do enough testing to ensure coverage of the code we've written, because there's a bad bug lurking here:


    *Main> readP_to_S field "foobar foobar foobar"
    [("foobar foobar foobar","")]
    *Main> readP_to_S field "\"foobar foobar foobar\""
    [("","\"foobar foobar foobar\"")]

Ouch! We weren't expecting that! The second example, consisting of a quoted field, got parsed as an empty field, and we somehow fail to recognize the presence of a quoted field. Why? It's because the parser simpleField succeeded, despite our expectations, returning a successful parse of the empty text field after consuming no characters. And if we look at the code for simpleField, it's easy to see that it can do just that. How can we fix this? A first reasonable question is to ask whether or not it makes sense for a simple field to be empty. My experience with CSV is yes, indeed, this often happens, e.g., if you write out a region of a spreadsheet as CSV, blank cells will be written out as empty fields. But we can use the information available to us a bit more thoroughly. After we've read a field (and it doesn't matter whether its a simple field or a quoted field), we should either be at the end of the file, or the next character should be either a comma or the beginning of a newline. So what we'll do is to write a little wrapper:


    import Control.Monad
    
    complete :: ReadP a -> ReadP a
    complete p = do
        result <- p
        peek <- look
        guard $ null peek || head peek `elem` ",\r\n"
        return result
    
    field :: ReadP String
    field = complete simpleField <++ complete quotedField

The look function returns the entire unparsed input–it's the moral equivalent of State's get. I've seen many students over the years misuse look (e.g., by using it to chose which parser to call next), but rather than ignoring it and hoping you won't notice it (which hasn't worked for me yet), let me take this opportunity to show you a proper use. Here, we're using look to make sure that after reading a field (whether simple or quoted), we're either at the end of the file, at the end of the record, or at the end of the field. The effect is to make a parser fail earlier on an unproductive alternative, which is a good thing.

With this, it's natural to use the description of a record (a sequence of fields separated by commas) to write the record parser:


    record :: ReadP [String]
    record = field `sepBy` char ','

Here, I have the advantage of a little experience. There are two versions of sepBy, the ordinary, unadorned sepBy, and a variant sepBy1. The distinction is that sepBy1 will never return the empty list. Is it reasonable to have an empty record? I suppose, but it would be have to be indicated by an empty line. But that already represents a record that consists of a single, empty field, and [] and [""] are distinct. So if we allow record to return an empty record, we know it's also going to return a trivial record on the same input, and so there will be an ambiguity, which will prevent a successful parse. So we'll use sepBy1:


    record :: ReadP [String]
    record = field `sepBy1` char ','

Whew. Are we done with the thinking parts yet? Unfortunately, no. Remember that there's that nagging ambiguity in usage as to whether a CSV file ends with a newline or not? This is essentially the question of whether newline is a record separator or a record terminator. We like to think that it's a terminator, because then an empty file represents no records, and it might be nice to be able to represent the empty relation. Whereas, if newline is a separator, then an empty file must represent a relation containing a single trivial record, [[""]]. So it would be nice to be able to dictate, “terminator.” But we know that both kinds of files exists on the internet, so let's think through the consequences of using a different convention for newlines than the CSV author intended. If we decide that newlines are statement terminators, and we get a file that ends with a non-trivial record, then that file will not parse. That's bad, really bad, if we're trying to be general purpose. Whereas, if we go against our better judgment, and interpret newlines as record separators, then if we encounter a file that ends with a newline, we'll end up adding a trivial record to the end. That's not great, but not great beats really bad.

OK, we think, why not just elide a terminal trivial record? Can we ever think of a context in which a terminal trivial record might be meaningful? Sadly, I can. That case where trivial records are relation separators might apply here, in which case a terminal trivial record might indicate the presence of a final, empty, relation. Edge cases are often like this, seeming artificial until there turns out to be no other way to do what you want to do.

So we need to make a engineering decision. My choice here is to go against my dictatorial preference, and interpret newline as a separator, but to reserve a mental note that this has to be documented. Not great beats really bad. The same sort of considerations that were in play with record applies here too, so...


    csv :: ReadP [[String]]
    csv = record `sepBy1` newline

Now we have a parser. But we want to deliver functions


    parseCSV :: String -> [[String]]
    readCSV :: FilePath -> IO [[String]]

So we still have some work to do. First, we'll adapt our parseWith function from the last Lecture to handle ReadP based parsers, which requires only that we replace runParser with readP_to_S:


    parseWith :: Parser a -> String -> a
    parseWith p s = case [a | (a,t) <- runParser p s, all isSpace t] of
        [a] -> a
        [] -> error "no parse"
        _ -> error "ambiguous parse"

And then we can write parseCSV and readCSV encapsulating the various parsers we wrote along the way.


    parseCSV :: String -> [[String]] 
    parseCSV  = parseWith csv where
        csv = record `sepBy1` newline
        record = field `sepBy1` char ','
        field = complete simpleField <++ complete quotedField
        simpleField = munch (`notElem` ",\"\n\r")
        quotedField = between (char '"') (char '"') (many nextChar)
        nextChar = satisfy (/= '"') <++ fmap (const '"') (string "\"\"")
        complete ma = do
            result <- ma
            peek <- look
            guard $ null peek || head peek `elem` ",\r\n"
            return result
        newline = string "\n" <++ string "\n\r" <++ string "\r"
    
    readCSV :: FilePath -> IO [[String]]
    readCSV = fmap parseCSV . readFile

And life is kind of good. We have a parser, one that withstands some fairly devious tests. We know that there's an issue with trivial records, which we'll note as a comment on parseCSV:


    {-
      The parseCSV function parses a String containing comma separated
      values (CSV) as a list of records, each of which is a list of fields
      of type String. Note that CSV is ambiguous, in that a newline may be
      intended either as a record terminator, or as a record separator. The
      parseCSV function treats newlines as record separators, which allows
      it to successfully parse in either case, albeit with a spurious final
      record consisting of a single, empty field, on a String in which
      newline was intended as a record terminator.
      
      Client code should be aware of the possibility (even likelihood) of
      encountering records that consist of a single, empty field.
    -}

But we are perhaps to be forgiven if there's a bit of an empty feeling too. We can wrap this all up in a module, but that module is somehow a second-class citizen compared to the modules that came with the Haskell Platform or we subsequently installed via cabal. We have to copy the source code for this module into every directory that holds Haskell source that wants to parse CSV. Moreover, rather than having the nice HTML documentation of the built-in modules, we're going to have to look at the source code to find the documentation and its warnings about possible trivial records. And if we find a bug later? Are we really going to find all of the directories of projects that relied on this code, and replace their CSV.hs files? This is not making things easy.

Let's find a still more excellent way. It is not that difficult to turn our project into a cabal project, and to take advantage of cabal's ability to create documentation and to integrate it into our Haskell environment.

Cabal Projects and Haddock

There is a good overview of this process on the HaskellWiki at haskell.org, How to write a Haskell program. The name notwithstanding, this wiki page describes how to package and distribute a Haskell program or library.

We start by creating a source directory, and running


    $ cabal init

This is going to ask us a lot of questions, and it's best to be prepared with some of the answers first. A good first question is, “Where do we want our module to fit in the Haskell Platform module hierarchy?” I opted for Text.CSV, so it would be together with other parsers. So my source directory is named Text.CSV, and that's where I ran cabal init. Along the way, I'll needed to provide a package name (which by convention is hyphen-separated lower-case, so I choose text-csv), along with my name, email address, license, etc. The “How to write” page notes that for code to be accepted into the Platform, it needs to be given a BSD license, so I go with that, and cabal creates the appropriate LICENSE file, along with Text.CSV.cabal and Setup.hs. Both get a quick proof reading, with a few minor tweaks and additions to Text.CSV.cabal based on the Cabal User Guide.

Next, I need to set up my source inside the package directory. To that end, I create a Text subdirectory, and move my source file into Text/CSV.hs, adding the appropriate hierarchical module declaration:


    module Text.CSV where

This follows a mandatory convention for source directories of projects that include hierarchical modules. Next up, I tweak my comments, adding Haddock syntax where appropriate. There's good documentation available in the Haddock User Guide.

I tend to write contract-oriented comments that precede code, i.e., my comments focus on what the function does (its contract), and sometimes on the why if that's not completely clear, perhaps with an example or three. I prefer to let the code itself document the how, perhaps with the mention of (better still, a link to) a publication if the code is the realization of a complex algorithm. To me, obscure code is more often a reason to re-write code, than a reason to enshrine obscurity via a comment. This style of commenting naturally lends itself to creating documentation, and I only need two little tweaks to the comments I'd write anyway to get great results. First, I add a space and a vertical bar at the beginning of each comment block that I want to include in the HTML documation: {- |. Then, if I enclose the mention of any functions, types, etc., within single quotes, which instructs Haddock to automagically generate the links to their documentation. So the comment above becomes


    {- |
      The 'parseCSV' function parses a 'String' containing comma separated
      values (CSV) as a list of records, each of which is a list of fields
      of type 'String'. Note that CSV is ambiguous, in that a newline may be
      intended either as a record terminator, or as a record separator. The
      'parseCSV' function treats newlines as record separators, which allows
      it to successfully parse in either case, albeit with a spurious final
      record consisting of a single, empty field, on a 'String' in which
      newline was intended as a record terminator.
      
      Client code should be aware of the possibility (even likelihood) of
      encountering records that consist of a single, empty field.
    -}

The “How to write” page recommends bringing the project directory under source control using either git or darcs. I've learned enough source control systems for this lifetime, so I stick with one I've used before, and chose git, establishing a repository on github.com. It also recommends that I run hlint on my source files, and hlint finds three minor improvements to the source code which I quickly adopted. For now, I stick with my informal testing regime, rather than using QuickCheck. All that's left is:


    $ cabal configure
    $ cabal build
    $ cabal install

And then... If I look at the local module documentation page, there's Text.CSV, with my comments beautifully formatted. And when I set up a little test directory outside of the project directory, import Text.CSV, and print the result of calling parseCSV on a little string, it works!!

Ahhhhh...... My code is now a first-class citizen, and so am I.

A few quick recommendations:

If you make changes, increment the fourth-level version number, or cabal will complain. It's good practice anyway, as the whole point to version numbers is to correspond one-to-one with source code versions. If your changes involve either adding or deleting functionality, you'll need to increment one of the other parts of version numbers, and should read Package versioning policy on the HaskellWiki to find out which. Cabal depends on this policy to configure reliable builds.
Before synchronizing your package directory with the repository, it's a good idea to do a cabal clean first, as this will remove all of the dist subdirectories and the build products they contain. I find it very convenient to use GitHub.app, a GUI program under MacOS X to manage synchronizing between the local package directory and github.com, and which almost completely obviates the need for using git via the command line.

Oh, and my CSV code? It's on GitHub. Life is good.

Becoming an Expert

There's a lot here in this last section: cabal, haddock, hlint, git/darcs, repositories. It's easy to get overwhelmed, but don't. Each is a little piece in a bigger puzzle, and each can be tamed enough to get started in an hour or less. And the rewards... . It's a short step from doing what I've shown you here to making your code available for distribution via Hackage, with publicity on the HaskellWiki and via the #Haskell IRC channel, and from there to the Haskell star you're becoming.

CMSC-16100

Honors Introduction to Programming, I

Autumn Quarter, 2015

Lecture 19: Practical Functional Parsing

Becoming an Expert

Cabal Projects and Haddock

Becoming an Expert