Lecture 18: Practical Functional Parsing
Text.ParserCombinators.ReadP
Having gone to all of the trouble of writing our own parser combinator library, we'll now set it aside and use a similar but much more developed and efficient combinator library that comes with the Haskell Platform: Text.ParserCombinators.ReadP
. Note that there are several other parser libraries for Haskell, and that ReadP
isn't necessarily the most popular, it's just the best pedagogical fit for us right now.
First, the good news. May familiar parser combinators from Parser
are available to us in ReadP
, notably, satisfy
, char
, and string
, as well as many other conveniences. ReadP
also belongs to all of the type classes that you'd expect: Functor
, Applicative
, Alternative
, Monad
, MonadFail
, and MonadPlus
.
The bad news is that when ReadP
was written, functional parsing was usually referred to as monadic parsing, and many of the other type classes hadn't yet been formalized. So there is an occasional awkwardness, in that ReadP
in some ways anticipated Applicative
, Alternative
and other type classes that didn't exist when it was written, and so in some ways is duplicative of them. In particular, ReadP
defines many
with the same meaning as Applicative
's many
, but conflicting with it; and many1
which is synonymous with Applicative
's some
. ReadP
defines (+++)
which is synonymous with <|>
, while pfail
is synonymous with empty
.
A mildly annoying difference is that the “run” function is readP_to_S
rather than runParser
. If it's sufficiently annoying, just define runParser = readP_to_S
, otherwise live with it. We'll live with it.
ReadP
was written with a concern for efficiency, and this leads us to consider a couple of new parser combinators.
(<++) :: ReadP a -> ReadP a -> ReadP a
is described as a left-biased alternative. The idea is that pa <++ pb
is a parser that tries pa
first. If it returns any results, those are the results, but if it fails, it next tries pb
. We can imagine that (<++)
is defined as
ap <++ bp = Parser $ \s ->
case runParser ap s of
[] -> runParser ab s
rs -> rs
although the actual definition is quite different, as the underlying representations are more complex. Next, we have:
munch :: (Char -> Bool) -> ReadP String
This function returns the largest substring of the string to be parsed that satisfy the predicate. This is subtly different from many (satisfy p)
, cf.
> readP_to_S (munch (=='a')) "aaabb"
[("aaa","bb")]
> readP_to_S (many $ satisfy (=='a')) "aaabb"
[("","aaabb"),("a","aabb"),("aa","abb"),("aaa","bb")]
This often doesn't make any difference at all (as only the longest version can be valid), but there can be a quadratic savings in not generating the shorter forms.
Note that there's a variant munch1
of munch
which succeeds only if there is at least one character in the result.
Becoming an Expert
The module Text.ParserCombinators.ReadP
contains many functions beyond those that we implemented in our Parser
module. A good strategy for building expertise in Haskell is to read through the documentation for modules like Text.ParserCombinators.ReadP
, and try to implement the various functions it contains. Then, follow the links to the source code, and see how Haskell experts have implemented them. This will give you practice, the opportunity to read and learn from experts, and also a close acquaintance with the facilities the module provides.
Example: Duplication With Variation
It's easy to duplicate a file, that's what the Unix cp
utility is for. But what if we want to produce a bunch of near-duplicates of a given text, i.e., we want those duplicates to vary somehow? If the variations are sufficiently simple, we can write a program that generates all of the variations, and produces the duplicated text. But often, we'll have some sort of database, and the variations will amount to the rows of one of the relations of that database. In this case, we'll want to be able to handle a file that contains our data, and CSV (comma separated values) with headers is a natural choice.
If we're only interested in duplicating a particular fixed base text, we incorporate that text into the logic of the program we're writing. But this is the sort of job where the text to be duplicated tends to vary over time too, and this leads to a preference to “move it out of the code.” So we'll create a simple text format to describe the underlying text.
As an aside, the code we're studying is a simplified version of the spam
program that Professor Kurtz uses to send out wait-list notifications and similar bulk emailings.
DupV.Template
Our templates will consist of ordinary text files, in which set-braces are used to indicate a placeholder to be filled in from our data file. For example, one of the test files for this program contains:
The Arabic numeral {arabic} and the Roman numeral {roman} both represent {english}.
The idea here is that {arabic}
, {roman}
, and {english}
will all be filled in with data from our CSV file.
Our internal representation of a Template
will be:
data Template = Template { items :: [TemplateItem] }
data TemplateItem
= Literal String
| Variable String
Our first programming task is to parse the input file, e.g., obtaining
Template {items =
[ Literal "The Arabic numeral "
, Variable "arabic"
, Literal " and the Roman numeral "
, Variable "roman"
, Literal " both represent "
, Variable "english"
, Literal ".\n"
]}
We'll use ReadP
:
parseTemplate :: ReadP Template
parseTemplate = Template <$> many parseTemplateItem where
parseTemplateItem = parseLiteral <++ parseVariable
parseVariable = Variable <$>
(char '{' *> munch1 isAlphaNum <* char '}')
parseLiteral = Literal <$> munch1 (`notElem` "{}")
This requires a bit of explanation. We use many
to parse a list of items. The items to be parsed are TemplateItems
, which come in one of two forms. We write sub-parser for each, using (<++)
to avoid a parse that must fail if the preceding parse succeeded. There's some subtly in both subparsers.
Our use of munch1
in parseLiteral
is important! If we just used munch
, the parser would succeed in producing a Literal ""
without reading any input, and in the context of many
would result in infinitely many Literal ""
's at the end of the string (as well as at each transition to a variable). Pragmatists will note that parsers should always consume some input, otherwise bad things can happen!
Our definition of parseVariable
uses two new applicative operators, <*
and *>
. These are sequencing operators, which return the value of the first, or second, argument respectively. Note that the relational operator “points towards” the applicative value we will keep. In the old days, we'd have written parseVariable
in a monadic style, which is superficially quite different, but essentially the same:
parseVariable = Variable $ do
char '"'
result <- munch1 isAlphaNum
char '"'
pure result
It is perfectly reasonable to write parsers this way when you're trying to figure things out, but it's useful to keep in mind that if a bound variable never appears on the right-side of a binding, it should be possible to rewrite the code applicatively. With practice, the applicative forms come first.
We can use the ReadP
parser to make Template
an instance of the Read
type class,
instance Read Template where
readsPrec _ = readP_to_S parseTemplate
and then introduce the convenience function
loadTemplateFile :: FilePath -> IO Template
loadTemplateFile path = read <$> readFile path
Finally, we have a simple bit of code for instantiating a template, given a list of keys and a list of values:
instantiate :: Template -> [String] -> [String] -> String
instantiate template header record =
let dict = Map.fromList $ zip header record
fill (Literal s) = s
fill (Variable v) = dict ! v
in concat . map fill . items $ template
This is simple, yet powerful code. We build a Map
, an efficient structure for manipulating key-value pairs, out of an association list built out of the keys and values. We define a fill :: TemplateItem -> String
that “evaluates” a TemplateItem
in the context of the Map
we just built. Finally, we map
the fill
function across the items
of the template
, obtaining a []
of String
, which we flatten using concat
.
DupV.SimpleCSV
The CSV format is deceptively simple, and parsing any individual CSV file is usually straightforward. A CSV file consists of a sequence of newline terminated records, where each record is a sequence of comma separated fields, and each field is just a String
. So, what's hard? CSV began life as an informal format. There is an Internet Engineering Task Force (IETF) Received for Comment (RFC, a.k.a., and internet standards document) RFC 4180 that describes CSV, but that “standard” is based on reverse engineering files that claimed to be CSV, so the cart of practice came before the horse of specification. And my experience of CSV includes files that don't meet the “standard” of RFC 4180, a very real caveat for anyone who emptor's it. So writing a good, general CSV parser has real challenges.
We'll start by writing a simplified ReadP
parser, and then deal with some of the complexities of CSV.
newtype CSV = CSV { content :: [[String]] }
instance Read CSV where
readsPrec _ = readP_to_S parseSimpleCSV
parseSimpleCSV :: ReadP CSV
parseSimpleCSV = CSV <$> record `endBy` newline <* eof where
newline = string "\n"
record = field `sepBy` char ','
field = munch (`notElem` ",\n")
There's a bit of complexity here.
As before, we're going to take advantage of the Read
type class, and so introduce use newtype
to wrap the type we're interested in. But read
can ignore white space at the end of its input, and in the case of CSV parsing, this can result in an ambiguity, so we use the eof
parser from ReadP
to ensure that the entire input string is read.
The endBy
parser combinator builds lists of values from a parser for the values and their terminators. There is a similar sepBy
parser combinator for building lists of values from a parser for the values and a parser for the separators. The expression record `endBy` newline
is a parser for a list of records, each of which must be terminated by a newline, while field `sepBy` char ','
is a parser for a list of fields separated by commas.
This would be good enough, if
- we were only going to run this code on Unix-like systems, and
- we never needed a comma or a newline within a field value.
Unfortunately, these are not assumptions we want to make. So there's a bit of work to do.
Newlines
Operating systems don't agree as to what precise sequence of characters constitutes a newline. Unix uses a bare linefeed (LF) "\n", a.k.a. ASCII 012 (that's an octal codepoint). MacOS Classic used a bare carriage return (CR) "\r", a.k.a., ASCII 015. MacOS X follows the Unix convention as befits its Unix foundations. Finally, Windows follows a convention that is as old as teletype machines, and relies on a CRLF pair, "\r\n". And CSV is such a basic file format that it could have come from anywhere, and may have passed through many hands, so we can't even be entirely confident that the same newline convention will be used consistently within a single file, although we expect that exceptions to this will be rare.
We'll take a pragmatic point of view. We may have to deal with Windows, Linux, and MacOS X, but we're not going to have to deal with MacOS Classic. So we can define
newline = string "\n" <++ string "\r\n"
Yes, this is biased towards Unix. I like it that way.
Commas
To deal with the issue of commas and newlines within fields, CSV has the notion of a quoted field: this is a field that begins and ends with a '"'
, and can contain anything but a '"'
within it. To avoid any ambiguity, CSV forbids simple (non-quoted) fields from containing a '"'
. Thus
field = quotedField <++ simpleField
simpleField = munch (`notElem` ",\n\"")
quotedField = char '"' *> munch (/='"') <* char '"'
Of course, this just trades one problem for another. We can now have commas in our fields, but we can't have double-quotes. To deal with this, CSV further allows a quoted double-quote within a quoted field. How do you quote a double-quote? In CSV, you repeat it. Thus, e.g.,
"He said, ""Foo!"""
To accommodate this, we'll replace the munch (/='"')
within the definition of quotedField
as follows:
quotedField = char '"' *> many quoteChar <* char '"'
quoteChar = satisfy (/= '"') <++ (string "\"\"" $> '"')
Note the useful $>
operator from Data.Functor
. The expression (string "\"\"" $> '"')
means that if we successfully parse a repeated double-quote, the value we should return is a single double-quote.
*Exercise 18.1 The use of many
in the definition of quotedField
introduces the very problem that munch
was written to solve. The problem is that we can't use munch
, as it has the wrong type (it acts on a character predicate, rather than general parser). Write the function
greedy :: ReadP a -> ReadP [a]
which greedily parses a sequence of values, returning only maximal sequences. Note that "maximal" in this context does not mean that list has the greatest possible length, nor that it ingests the maximum number of input characters, but rather that it cannot be extended. E.g.,
> readP_to_S (greedy (string "a" +++ string "ab")) "abaac"
[(["a"],"baac"),(["ab","a","a"],"c")]
Hint: the (<++)
parser combinator is very helpful!
Main
Our main
is
main :: IO ()
main = do
args <- getArgs
case args of
[variationFile,templateFile] -> do
template <- loadTemplateFile templateFile
header:variations <- content <$> loadSimpleCSV variationFile
putStr . concat . (`map` variations) $ \variation ->
instantiate template header variation
_ -> do
hPutStrLn stderr "Usage: spam variationFile templateFile"
exitFailure
This is mostly straightforward.
Real Programming
There is a good overview of this process on the HaskellWiki
at haskell.org
, How to write a Haskell program. The name notwithstanding, this wiki page describes how to package and distribute a Haskell program or library.
I recommend using git
for version control, cabal
for builds, and haddock
for documentation.
Real programs often try to move as much of their data into data files as possible (we've done this), and as much of their internal logic into modules organized around basic data structures an the algorithms that support them. This simplifies code reuse.
Real programs also require a means of distribution. For this particular program, the sources are on GitHub: stuartkurtz/DupV, and the program can easily be downloaded and installed from there.
Exercise 18.2 One way this program can be made much more useful is to add a number of pre-defined variables, e.g., the time and date of processing. Modify the dupv
program to include such a feature.