Functional Parsing

We have implemented several functions for reading values:

read :: Read a => String -> a

Although a “parsing” function of the form…

type Parser a = String -> a

… works for simple types of data, often we’ll need to more incrementally, and non-locally, manipulate the input string in order to determine how it should be interpreted. More like a stateful function:

type Parser a = String -> (a, String)

We will furthermore often need to return multiple possible parses:

type Parser a = String -> [(a, String)]

And, as we’ve done many times before, create a new type…

data Parser a = Parser (String -> [(a, String)])

… without the run-time overhead…

newtype Parser a = Parser (String -> [(a, String)])

… and with record syntaxs to automatically generating a selector for retrieving the function inside:

newtype Parser a = Parser { runParser :: String -> [(a, String)] }

So, we’ve got a “combination” of State and List. We’ll talk more about this another day.

As you’ll see in a compilers course, a traditional way to write a parser is to factor it into into lexing (which breaks a string into a stream of tokens) and parsing (which consumes tokens according to a grammar for the language). Domain-specific languages for lexers and parsers are often used to generate code that integrates with the rest of the compiler implementation (type checking, evaluation or code generation, etc).

In contrast, the approach here builds and composes parsers using the building blocks of functional programming we have been studying.

Basic Parsers

Parsing Characters

char :: Char -> Parser Char
char c = Parser $ \s -> case s of
  []     -> []
  (a:as) -> if c == a then [(a,as)] else []

> runParser (char 'C') "Cool"
> runParser (char 'C') "Not Cool"

Clean this up with a cute pattern called Fokker’s trick:

char c = Parser $ \s -> case s of
  []     -> []
  (a:as) -> [ (a,as) | c == a ]

We can make this work more generally:

satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser $ \s -> case s of
    []     -> []
    (a:as) -> [ (a,as) | p a ]

char c = satisfy (c==)
alpha  = satisfy isAlpha
digit  = satisfy isDigit
space  = satisfy isSpace

> runParser alpha "Hello!"
> runParser alpha "!Hello!"

Parsing Strings

string :: String -> Parser String
string str = Parser $ \s ->
  let (pre, suf) = splitAt (length str) s in
  if str == pre then [(pre, suf)] else []

Again, use Fokker’s trick:

string str = Parser $ \s ->
  [ (pre, suf) | let (pre, suf) = splitAt (length str) s, str == pre ]
> runParser (string "Hello") "Hello!"
> runParser (string "Hello") " Hello!"

Parsing Tokens

A token is a constant string that may some meaning. For example, in Haskell, the token True refers to one of the data constructors of the Bool datatype, and the token let marks the beginning of a let-expression.

The token parser assigns meaning to a String token:

token :: String -> a -> Parser a
token str a = Parser $ \s0 ->
  [ (a, s0) | (_, s1) <- runParser (string str) s ]

true  = token "True" True
false = token "False" False
> runParser true "True"
> runParser true "False"

Basic Parser Combinators

Given some existing parsers, we might like to compose them. For example, sequentially…

hello :: Parser String
hello = Parser $ \s0 ->
  [ ("Hello", s5)
      | (a1,s1) <- runParser (char 'H') s0
      , (a2,s2) <- runParser (char 'e') s1
      , (a3,s3) <- runParser (char 'l') s2
      , (a4,s4) <- runParser (char 'l') s3
      , (a5,s5) <- runParser (char 'o') s4
      ]

… or in parallel:

bool :: Parser Bool
bool = Parser $ \s0 ->
  runParser true s0 ++ runParser false s0

The Nuclear Haskell FAMily

Let’s implement Monad first and then take the free instances of Functor and Applicative.

instance Functor     Parser where {fmap f x = pure f <*> x}
instance Applicative Parser where {pure = return; (<*>) = ap}

instance Monad Parser where
  return :: a -> Parser a
  return a = Parser $ \s0 -> [(a, s0)]

  (>>=) :: Parser a -> (a -> Parser b) -> Parser b
  pa >>= f = Parser $ \s0 ->
    [ (b, s2) | (a, s1) <- runParser pa s0
              , (b, s2) <- runParser (f a) s1 ]

Exercise: Implement the Functor and Applicative instances directly.

Example

We can now rewrite token more succinctly:

token str tok =
  -- fmap (\_ -> tok) (string str)
  -- fmap (const tok) (string str)
     const tok <$> string str

Example

hello :: Parser String
hello = do
  _ <- char 'H'
  _ <- char 'e'
  _ <- char 'l'
  _ <- char 'l'
  _ <- char 'o'
  pure "Hello"

Or, because each successive parser is “independent” of the previous, use the Applicative interface:

hello :: Parser String
hello =
  pure (\_ _ _ _ _ -> "Hello")
   <*> char 'H'
   <*> char 'e'
   <*> char 'l'
   <*> char 'l'
   <*> char 'o'

Alternative

The Alternative class describes types that — in addition to being applicative functors — have an identity element called empty, and a binary associative operator called (<|>):

class Applicative t => Alternative t where
  empty :: t a
  (<|>) :: t a -> t a -> t a

The Maybe instance is defined to return the first Just value, if any:

> foldr (<|>) empty [Nothing, Just 2, Just 23]
Just 2

There is a useful definition of Alternative for parsers:

instance Alternative Parser where
  empty :: Parser a
  empty = Parser $ \s -> []

  (<|>) :: Parser a -> Parser a -> Parser a
  (Parser f) <|> (Parser g) = Parser $ \s -> f s ++ g s

Examples

bool :: Parser Bool
bool = true <|> false
> runParser bool "TrueFalse"
> runParser bool "FalseTrue"

Control.Applicative defines several utility functions, including:

some     :: Alternative t => t a -> t [a]
many     :: Alternative t => t a -> t [a]
optional :: Alternative t => t a -> t (Maybe a)

We can use these to define parsers that recognize one-or-more, zero-or-more, or zero-or-one matches, respectively:

> runParser (some bool) "TrueTrue!"
> runParser (many bool) "TrueTrue!"
> runParser (optional bool) "TrueTrue!"

The definitions of some and many are a bit clever, using pure and (<|>). Try to implement them, and then take a peek at the library source code.

Additional Parser Combinators

To do in class and/or at home. (Note: Don’t worry about order of matches in the example interactions; list is modeling “non-determinism.”)

option :: a -> Parser a -> Parser a

(<++) :: Parser a -> Parser a -> Parser a

munch :: (Char -> Bool) -> Parser String

sepBy  :: Parser a -> Parser sep -> Parser [a]
sepBy1 :: Parser a -> Parser sep -> Parser [a]

between :: Parser open -> Parser close -> Parser a -> Parser a

skipSpaces :: Parser ()

Source Files

The Read Class

So far, we have understood this class as follows:

class Read a where
    read :: String -> a
    ...

But actually, read is defined as a library function outside of the class:

class Read a where
 -- readsPrec :: Int -> ReadS a
    readsPrec :: Int -> String -> [(a, String)]

    ...

The type ReadS is an alias for the “raw” parsing functions we have been working with. The Int argument pertains to precedence levels, which often arises in practice but which we will not consider here. So, given a parserT :: Parser T for some data type T, we can implement the following:

instance Read T where
    readsPrec _ = runParser parserT

Compared to read, here is a version that deals with errors explicitly rather than crashing:

readEither :: Read a => String -> Either String a
readEither s0 =
  case reads s0 of
    [(a,"")] -> Right a
    [(a,s1)] -> Left $ "suffix not consumed: [" ++ s1 ++ "]"
    []       -> Left $ "no successful parses"
    _        -> Left $ "multiple successful parses"

See readMaybe and readEither in Text.Read for similar functionality. (The function reads is equivalent to: readsPrec 0)

Example: Parsing Lists

Let’s write a simple parser for a simple type of lists.

data ListInt
  = Nil
  | Cons Int ListInt

instance Show ListInt where
  show Nil = "[]"
  show (Cons n ns) = show n ++ ":" ++ show ns

instance Read ListInt where
  readsPrec _ = runParser listInt

listInt :: Parser ListInt
listInt = nil <|> cons

Start with parsing empty list:

nil :: Parser ListInt
nil = do
  string "[]"
  pure Nil

Now parse cons:

int :: Parser Int
int = do
  n <- some $ satisfy isDigit
  pure $ read n

cons :: Parser ListInt
cons = do
  n <- int
  char ':'
  ns <- listInt
  pure $ Cons n ns

Okay, now let’s add sugared versions:

sweetList :: Parser ListInt
sweetList = do
  char '['
  ns <- int `sepBy` (char ',')
  char ']'
  pure $ foldr Cons Nil ns

Clean up sweetList with between:

sweetList = do
  ns <- between (char '[') (char ']') (sepBy int (char ','))
  pure $ foldr Cons Nil ns

This works on its own, but…

listInt = nil <|> cons <|> sweetList

>> read "[]" :: List
*** Exception: Prelude.read: ambiguous parse

Problem is sweetList might match [], but so does nil. So, could use sepBy1 instead:

ns <- between (char '[') (char ']') (sepBy1 int (char ','))

But, in any case, these different parsers accept disjoint sets of strings, so no need to run all.

listInt = nil <++ cons <++ sweetList

Monadic vs. Applicative Parsing

In our implementations of nil, int, cons, and sweetList above, we used (>>=) (via do-notation) to stitch together multiple parsers. But none of those implementations actually require the full power of the Monad interface (i.e. the ability of one parser to depend on the results of the previous).

nil, int, and sweetList can be written via the Functor interface:

nil = const Nil <$> string "[]"

int = read <$> (some $ satisfy isDigit)

sweetList =
  foldr Cons Nil <$>
    between (char '[') (char ']') (sepBy int (char ','))

cons cannot, because the pure function we want to apply, a wrapper around ConsList, takes one more than one argument. Enter Applicative:

cons =
  (\n _ ns -> Cons n ns)
    <$> int
    <*> char ':'
    <*> listInt

Parser Pipelines

Having to create a lambda just to ignore some of the results is clunky. Here is where the Applicative operator (<*) comes in handy.

(<*>) :: Applicative t => t (a -> b) -> t a -> t b
(<*)  :: Applicative t => t       b  -> t a -> t b

tb <* ta = pure (\b _ -> b) <*> tb <*> ta
(<*)     = liftA2 const

For example, rather than writing…

(\a1 _ a3 _ a5 -> e) <$> x1 <*> x2 <*> x3 <*> x4 <*> x5

… can write

(\a1 a3 a5 -> e) <$> x1 <* x2 <*> x3 <* x4 <*> x5

Notice that sprinkling in (<*>) “keeps” the result of the second parser whereas sprinkling in (<*) “ignores” the result.

May also choose to forgo (<$>) before the first argument so that there are only two operators to look at:

pure (\a1 a3 a5 -> e) <*> x1 <* x2 <*> x3 <* x4 <*> x5

Formatting this expression with newlines results in a nice “parser pipeline”:

pure (\a1 a3 a5 -> e)
  <*> x1
  <*  x2
  <*> x3
  <*  x4
  <*> x5

Now, back to cons:

cons :: ReadP List
cons =
  pure Cons
    <*> int
    <*  char ':'
    <*> listInt

And we could choose to write parser pipelines for all the rest, too, to create a similar feel (even though we don’t need Applicative for int and sweetList):

int :: ReadP Int
int =
  pure read
    <*> (some $ satisfy isDigit)

cons :: ReadP List
cons =
  pure Cons
    <*> int
    <*  char ':'
    <*> listInt

sweetList :: ReadP List
sweetList =
  pure (foldr Cons Nil)
    <*> between (char '[') (char ']') (sepBy int (char ','))

Source Files

Practical Parsing

When continuing with larger Haskell developments beyond this course, take note of the following library implementation of parser combinators.

ReadP Library

Though elegant, the functional parsing implementation — which maintains lists of potential results and performs backtracking — is not the most efficient. The Text.ParserCombinators.ReadP library offers a more efficient and full-featured implementation.

> :info ReadS    -- type alias for String -> [(a,String)]

> :info ReadP
newtype ReadP a = R (forall b. (a -> P b) -> P b)

The type, involving forall b inside a data constructor, is an example of something called a higher-rank type. We’re not going to talk about how this internal representation works here. Just note that ReadP works like Parser, but there are no (public) data constructors.

Small differences in names, functionality, types compared to our Parser. Indeed, the combinators we implemented above emulate those from ReadP.

Parser a ReadP a
Parser readS_to_P (avoid using this!)
runParser readP_to_S
char char
satisfy satisfy
string string
token
(<$>) (<$>)
(<*>) (<*>)
(>>=) (>>=)
(<|>) (Alternative) (<|>), (+++)
many (Alternative) many
some (Alternative) many1
optional (Alternative) optional (works differently)
option option
(<++) (<++)
munch munch
sepBy / sepBy1 sepBy / sepBy1
between between
skipSpaces skipSpaces
look (lookahead efficiently)

ReadPrec

Compared to the minimal complete definition

readsPrec :: Int -> ReadS a

there is a “new-style” parser interface

readPrec :: ReadPrec a

with a ReadPrec type that manages precedence of ReadP parsers more efficiently.

Have fun digging into the details!