Lecture 19: Practical Functional Parsing

Having gone to all of the trouble of writing our own parser combinators, we'll now set that aside and use a similar but much more developed and efficient combinator library that comes with the Haskell Platform: Text.ParserCombinators.ReadP. Note that there are several other parser libraries for Haskell, and that ReadP isn't necessarily the most popular, it's just the best pedagogical fit for us right now. We can pretty much use ReadP as a plug-in replacement for Parser, but there a few things we should change to make our use of it more efficient:

For example, if we assume

runParser :: ReadP a -> ReadS a runParser = readP_to_S token :: String -> a -> ReadP a token str a = fmap (const a) (string str)

we can contrast ordinary with left-biased alternation by

> runParser (token "a" 'a' +++ token "a" 'b') "a" [('a',""),('b',"")] > runParser (token "a" 'a' <++ token "a" 'b') "a" [('a',"")]

Likewise, we can illustrate the greedy nature of munch with many by contrasting

> runParser (R.many . satisfy $ isAlpha) "abc123" [("","abc123"),("a","bc123"),("ab","c123"),("abc","123")]

with

> runParser (munch isAlpha) "abc123" [("abc","123")]

Note that we used import Text.ParserCombinators.ReadP as R (and import Control.Applicative as A) to allow us to work around the conflict over many, and also that munch and many, while often thought of together, take very different types of arguments (a character predicate for munch, an arbitrary parser for many).

*Exercise 19.1 Extend Parser.hs to include (<++).

*Exercise 19.2 Consider the following type declaration:

data ComplexInt = ComplexInt Int Int deriving (Show)

Use Text.ParserCombinators.ReadP to implement Read ComplexInt, where you can accept either the syntax "12" for ComplexInt 12 0 or "(1,-2)" for ComplexInt 1 (-2). This time, as the example indicates, I also want you to worry about the possibility of a leading minus sign before an integer. Hint: consider using the option parser combinator from Text.ParserCombinators.ReadP.

Becoming an Expert

The module Text.ParserCombinators.ReadP contains many functions beyond those that we implemented in our Parser module. A good strategy for building expertise in Haskell is to read through the documentation for modules like Text.ParserCombinators.ReadP, and try to implement the various functions it contains. Then, follow the links to the source code, and see how Haskell experts have implemented them. This will give you practice, the opportunity to read and learn from experts, and also a close acquaintance with the facilities the module provides.

Let's consider a parsing task that seems simple at first, but which is really a good deal more difficult than it might appear: parsing comma separated values (CSV).

The format is deceptively simple, and parsing any individual CSV file is usually straightforward. A CSV file consists of a sequence of newline terminated records, where each record is a sequence of comma separated fields, and each field is just a String. So, what's hard? CSV began life as an informal format. There is an Internet Engineering Task Force (IETF) Received for Comment (RFC, a.k.a., and internet standards document) RFC 4180 that describes CSV, but that “standard” is based on reverse engineering files that claimed to be CSV, so the cart of practice came before the horse of specification. And my experience of CSV includes files that don't meet the “standard” of RFC 4180, a very real caveat for anyone who emptor's it. So writing a good, general CSV parser has real challenges.

There are (at least) three practical issues that intervene here.

For the sake of the discussion below, we'll say that a CSV record is trivial if it consists of a single, empty field. Trivial records are going to be the bane of our efforts to produce a truly general-purpose CSV parser, because ambiguities in interpretation often boil down to the question of whether a trivial record was intended at some point in the record sequence. It's tempting to just suppress all trivial records, thereby eliminating the intrinsic ambiguity in CSV, but it isn't really possible to rule out the possibility that the CSV author intended to include trivial records, e.g., to use a single CSV file to represent several distinct relations, segmented by trivial records.

With all this on our minds, let's get started. It's easiest to build the parser from the bottom up. First, let's deal with that pesky newline issue:

import Text.ParserCombinators.ReadP newline :: ReadP String newline = string "\n" <++ string "\r\n" <++ string "\r"

This is very intentional code. It is essential that we put the string "\r\n" before the string "\r", and that we use the short-circuiting (<++) between them, otherwise we'll recognize the CRLF sequence as consisting of two newlines, which would cause us to conjure into being a trivial record between the two characters of what's supposed to be an indivisible sequence representing a newline. And while there exists a tenuous theoretical possibility that a file may have been edited by hand first on a MacOS X classic system, and then subsequently on a Unix system, in such a way a trivial record actually was intended, it's hard to imagine this actually happening in practice. After all, the people who edit/produce CSV have to live with themselves and their files. The far, far, far more likely explanation for a CRLF pair is that the CSV file was produced on a Windows system, and that it is intended to represent a single newline.

A next note is that we're combining parsers using ReadP's left-biased sum (<++), in which a successful parse by the left-hand summand precludes any attempt to parse using the right-hand summand. This short-circuiting asymmetry creates a performance incentive for putting the parser that's more likely to succeed first, or to put this slightly differently, we pay a performance penalty for putting a parser that's going to fail before a parser that's going to succeed, because if the successful parser runs first, (<++) won't bother to waste any computational effort in making the failing parser fail. I run on Unix systems, so I want the Unix-favoring alternative first. It's good to be the maintainer.

Next, we'll deal with fields. First, simple, unquoted fields:

simpleField :: ReadP String simpleField = munch (`notElem` ",\"\n\r")

The parser combinator munch is greedy, i.e., it will read as many characters as possible that satisfy the predicate, and so it can be significantly more efficient to use munch p than many $ satisfy p. In this case, we're going to read all of the characters that are permitted in a simple field, i.e., everything but a comma, a double-quote, an LF, or a CR. A quoted field must begin and end with quotes.

Next up, quoted fields...

nextChar :: ReadP Char nextChar = satisfy (/='"') <++ fmap (const '"') (string "\"\"") quotedField :: ReadP String quotedField = do char '"' result <- many nextChar char '"' return result

The nextChar parser returns the next character to be parsed, collapsing doubled double quotes into a single double quote. The quotedField function reads a double quote, a run of characters produced by nextChar, and then a terminal double quote. Now it happens that this particular pattern—read something, read and remember something, read something, then return the remembered thing—happens a lot in parsing (think parenthesized expressions). So ReadP has a function between :: ReadP open -> ReadP close -> ReadP a -> ReadP a that captures this pattern, allowing us to simplify the definition of quotedField to

quotedField :: ReadP String quotedField = between (char '"') (char '"') (many nextChar)

We can now combine these two specialized field parsers into a general field parser:

field :: ReadP String field = simpleField <++ quotedField

With any luck, we'll do enough testing to ensure coverage of the code we've written, because there's a bad bug lurking here:

*Main> readP_to_S field "foobar foobar foobar" [("foobar foobar foobar","")] *Main> readP_to_S field "\"foobar foobar foobar\"" [("","\"foobar foobar foobar\"")]

Ouch! We weren't expecting that! The second example, consisting of a quoted field, got parsed as an empty field, and we somehow fail to recognize the presence of a quoted field. Why? It's because the parser simpleField succeeded, despite our expectations, returning a successful parse of the empty text field after consuming no characters. And if we look at the code for simpleField, it's easy to see that it can do just that. How can we fix this? A first reasonable question is to ask whether or not it makes sense for a simple field to be empty. My experience with CSV is yes, indeed, this often happens, e.g., if you write out a region of a spreadsheet as CSV, blank cells will be written out as empty fields. But we can use the information available to us a bit more thoroughly. After we've read a field (and it doesn't matter whether its a simple field or a quoted field), we should either be at the end of the file, or the next character should be either a comma or the beginning of a newline. So what we'll do is to write a little wrapper:

import Control.Monad complete :: ReadP a -> ReadP a complete p = do result <- p peek <- look guard $ null peek || head peek `elem` ",\r\n" return result field :: ReadP String field = complete simpleField <++ complete quotedField

The look function returns the entire unparsed input–it's the moral equivalent of State's get. I've seen many students over the years misuse look (e.g., by using it to chose which parser to call next), but rather than ignoring it and hoping you won't notice it (which hasn't worked for me yet), let me take this opportunity to show you a proper use. Here, we're using look to make sure that after reading a field (whether simple or quoted), we're either at the end of the file, at the end of the record, or at the end of the field. The effect is to make a parser fail earlier on an unproductive alternative, which is a good thing.

With this, it's natural to use the description of a record (a sequence of fields separated by commas) to write the record parser:

record :: ReadP [String] record = field `sepBy` char ','

Here, I have the advantage of a little experience. There are two versions of sepBy, the ordinary, unadorned sepBy, and a variant sepBy1. The distinction is that sepBy1 will never return the empty list. Is it reasonable to have an empty record? I suppose, but it would be have to be indicated by an empty line. But that already represents a record that consists of a single, empty field, and [] and [""] are distinct. So if we allow record to return an empty record, we know it's also going to return a trivial record on the same input, and so there will be an ambiguity, which will prevent a successful parse. So we'll use sepBy1:

record :: ReadP [String] record = field `sepBy1` char ','

Whew. Are we done with the thinking parts yet? Unfortunately, no. Remember that there's that nagging ambiguity in usage as to whether a CSV file ends with a newline or not? This is essentially the question of whether newline is a record separator or a record terminator. We like to think that it's a terminator, because then an empty file represents no records, and it might be nice to be able to represent the empty relation. Whereas, if newline is a separator, then an empty file must represent a relation containing a single trivial record, [[""]]. So it would be nice to be able to dictate, “terminator.” But we know that both kinds of files exists on the internet, so let's think through the consequences of using a different convention for newlines than the CSV author intended. If we decide that newlines are statement terminators, and we get a file that ends with a non-trivial record, then that file will not parse. That's bad, really bad, if we're trying to be general purpose. Whereas, if we go against our better judgment, and interpret newlines as record separators, then if we encounter a file that ends with a newline, we'll end up adding a trivial record to the end. That's not great, but not great beats really bad.

OK, we think, why not just elide a terminal trivial record? Can we ever think of a context in which a terminal trivial record might be meaningful? Sadly, I can. That case where trivial records are relation separators might apply here, in which case a terminal trivial record might indicate the presence of a final, empty, relation. Edge cases are often like this, seeming artificial until there turns out to be no other way to do what you want to do.

So we need to make a engineering decision. My choice here is to go against my dictatorial preference, and interpret newline as a separator, but to reserve a mental note that this has to be documented. Not great beats really bad. The same sort of considerations that were in play with record applies here too, so...

csv :: ReadP [[String]] csv = record `sepBy1` newline

Now we have a parser. But we want to deliver functions

parseCSV :: String -> [[String]] readCSV :: FilePath -> IO [[String]]

So we still have some work to do. First, we'll adapt our parseWith function from the last Lecture to handle ReadP based parsers, which requires only that we replace runParser with readP_to_S:

parseWith :: Parser a -> String -> a parseWith p s = case [a | (a,t) <- runParser p s, all isSpace t] of [a] -> a [] -> error "no parse" _ -> error "ambiguous parse"

And then we can write parseCSV and readCSV encapsulating the various parsers we wrote along the way.

parseCSV :: String -> [[String]] parseCSV = parseWith csv where csv = record `sepBy1` newline record = field `sepBy1` char ',' field = complete simpleField <++ complete quotedField simpleField = munch (`notElem` ",\"\n\r") quotedField = between (char '"') (char '"') (many nextChar) nextChar = satisfy (/= '"') <++ fmap (const '"') (string "\"\"") complete ma = do result <- ma peek <- look guard $ null peek || head peek `elem` ",\r\n" return result newline = string "\n" <++ string "\n\r" <++ string "\r" readCSV :: FilePath -> IO [[String]] readCSV = fmap parseCSV . readFile

And life is kind of good. We have a parser, one that withstands some fairly devious tests. We know that there's an issue with trivial records, which we'll note as a comment on parseCSV:

{- The parseCSV function parses a String containing comma separated values (CSV) as a list of records, each of which is a list of fields of type String. Note that CSV is ambiguous, in that a newline may be intended either as a record terminator, or as a record separator. The parseCSV function treats newlines as record separators, which allows it to successfully parse in either case, albeit with a spurious final record consisting of a single, empty field, on a String in which newline was intended as a record terminator. Client code should be aware of the possibility (even likelihood) of encountering records that consist of a single, empty field. -}

But we are perhaps to be forgiven if there's a bit of an empty feeling too. We can wrap this all up in a module, but that module is somehow a second-class citizen compared to the modules that came with the Haskell Platform or we subsequently installed via cabal. We have to copy the source code for this module into every directory that holds Haskell source that wants to parse CSV. Moreover, rather than having the nice HTML documentation of the built-in modules, we're going to have to look at the source code to find the documentation and its warnings about possible trivial records. And if we find a bug later? Are we really going to find all of the directories of projects that relied on this code, and replace their CSV.hs files? This is not making things easy.

Let's find a still more excellent way. It is not that difficult to turn our project into a cabal project, and to take advantage of cabal's ability to create documentation and to integrate it into our Haskell environment.

Cabal Projects and Haddock

There is a good overview of this process on the HaskellWiki at haskell.org, How to write a Haskell program. The name notwithstanding, this wiki page describes how to package and distribute a Haskell program or library.

We start by creating a source directory, and running

$ cabal init

This is going to ask us a lot of questions, and it's best to be prepared with some of the answers first. A good first question is, “Where do we want our module to fit in the Haskell Platform module hierarchy?” I opted for Text.CSV, so it would be together with other parsers. So my source directory is named Text.CSV, and that's where I ran cabal init. Along the way, I'll needed to provide a package name (which by convention is hyphen-separated lower-case, so I choose text-csv), along with my name, email address, license, etc. The “How to write” page notes that for code to be accepted into the Platform, it needs to be given a BSD license, so I go with that, and cabal creates the appropriate LICENSE file, along with Text.CSV.cabal and Setup.hs. Both get a quick proof reading, with a few minor tweaks and additions to Text.CSV.cabal based on the Cabal User Guide.

Next, I need to set up my source inside the package directory. To that end, I create a Text subdirectory, and move my source file into Text/CSV.hs, adding the appropriate hierarchical module declaration:

module Text.CSV where

This follows a mandatory convention for source directories of projects that include hierarchical modules. Next up, I tweak my comments, adding Haddock syntax where appropriate. There's good documentation available in the Haddock User Guide.

I tend to write contract-oriented comments that precede code, i.e., my comments focus on what the function does (its contract), and sometimes on the why if that's not completely clear, perhaps with an example or three. I prefer to let the code itself document the how, perhaps with the mention of (better still, a link to) a publication if the code is the realization of a complex algorithm. To me, obscure code is more often a reason to re-write code, than a reason to enshrine obscurity via a comment. This style of commenting naturally lends itself to creating documentation, and I only need two little tweaks to the comments I'd write anyway to get great results. First, I add a space and a vertical bar at the beginning of each comment block that I want to include in the HTML documation: {- |. Then, if I enclose the mention of any functions, types, etc., within single quotes, which instructs Haddock to automagically generate the links to their documentation. So the comment above becomes

{- | The 'parseCSV' function parses a 'String' containing comma separated values (CSV) as a list of records, each of which is a list of fields of type 'String'. Note that CSV is ambiguous, in that a newline may be intended either as a record terminator, or as a record separator. The 'parseCSV' function treats newlines as record separators, which allows it to successfully parse in either case, albeit with a spurious final record consisting of a single, empty field, on a 'String' in which newline was intended as a record terminator. Client code should be aware of the possibility (even likelihood) of encountering records that consist of a single, empty field. -}

The “How to write” page recommends bringing the project directory under source control using either git or darcs. I've learned enough source control systems for this lifetime, so I stick with one I've used before, and chose git, establishing a repository on github.com. It also recommends that I run hlint on my source files, and hlint finds three minor improvements to the source code which I quickly adopted. For now, I stick with my informal testing regime, rather than using QuickCheck. All that's left is:

$ cabal configure $ cabal build $ cabal install

And then... If I look at the local module documentation page, there's Text.CSV, with my comments beautifully formatted. And when I set up a little test directory outside of the project directory, import Text.CSV, and print the result of calling parseCSV on a little string, it works!!

Ahhhhh...... My code is now a first-class citizen, and so am I.

A few quick recommendations:

Oh, and my CSV code? It's on GitHub. Life is good.

Becoming an Expert

There's a lot here in this last section: cabal, haddock, hlint, git/darcs, repositories. It's easy to get overwhelmed, but don't. Each is a little piece in a bigger puzzle, and each can be tamed enough to get started in an hour or less. And the rewards... . It's a short step from doing what I've shown you here to making your code available for distribution via Hackage, with publicity on the HaskellWiki and via the #Haskell IRC channel, and from there to the Haskell star you're becoming.