Monday, July 14, 2014

Parsing RAML with FParsec, Part 2: The Spec That Launched A Thousand LOC

RAML, the Spec

So, what exactly does RAML look like?  You can find the official spec on GitHub, but that seems a bit remote (Ha!  Sorry, couldn't resist the DVCS pun :)).  Let's take a look at what some actual specs would look like and see if we can build an intuition from that.

Here's an embed of a file that's in my repo that I'm using for testing:


First off, RAML is an extension of YAML, which is a significant-whitespace, human-readable data interchange format.  It also supports remote references, which may will likely be a pain to parse later.  This isn't an exhaustive example of all the things RAML must support, but it's enough to get going.

Baby Steps

My general approach with this project is going to be to take each item in line in this example, taking care to adhere to the spec as best as I can interpret for each type of element.  I'm going to try to be pretty exhaustive, because I'm pretty sure this blog is going to be my own reference for what I was thinking when I'm crawling through git blame cursing at myself.

So, let's start out with the very first tag, the RAML spec number.  It looks like this:


This looks very similar to a YAML version number; namely it's 
  1. a special comment marker (#%), 
  2. a file type (RAML), 
  3. some whitespace, 
  4. and the version number.  
Because we have these discrete elements, why not parse them out in the same way I just described?  The nice thing about FParsec, and other libraries known as "applicative combinators", is that it allows us to do just that.  We can define parsers for each of the elements above, chain them in sequence, and get some greater meaning out of the disparate parts.

AST, Oh Me!

Before we jump into the parsing, we need to have some idea of what the final representation of our API will look like.  This schema is an Abstract Syntax Tree, or AST.  We'll define the shape of RAML in an AST, and then define a parser for a single part of the AST, joining them together one by one until we have a final, full implementation of the spec.

So! Starting from the very beginning of our RAML sample, the spec number, our AST would look like this in F#:


Our spec, initially, will only be able to parse the version.  So let's go to that.

Our First Parser

FParsec comes with a multitude of parsers. Let's take a look at the first one we'll be using, skipString.  Its signature is:
val skipString: string -> Parser<unit, 'u>
Let's break this down a bit. 
  1. val skipString- this is just the name of the function . This parser is a public member of its containing module.
  2. string -> - this parser is a function that takes a string and returns something else
  3. Parser<unit, 'u> - this parser yields unit (or void), and some generic User State.  This User State is something that we don't care about right now, so pretend it's of type unit as well, or void.
So how do we use this to do something useful?  Well, if you combine the skipString parser with a string to expect, what it'll do is walk through the input stream and consume that string, and then throw it away, advancing us along.  This is perfect for situations like we have here, where there's some useless string standing between us and the version number we want to parse.  Let's try using the skipString parser generator function to generate a parser for the sentinel string for the version number.


Ok, we're set.  Now, the whole point of FParsec is to allow us to chain building block like these together to make more complex constructs.  We chain these together with a bunch of different operators.  The ones I use most commonly, so far at least, are:
  • .>>
    • This operator is used like this: parserA .>> parserB, and means "evaluate parserA, then evaluate parserB, and return the result of parserA"
  • >>.
    • This operator is used the same way, but returns the result of parserB instead.
So, to make a full parser for the version string line in a spec, we'd want to chain together a parser for the sentinel (which we've done), a parser for the space, a parser for the version number, and a parser that advances us to the next line.  Using the operators we just saw, that looks like this:

 If you read this in order, it reads:
  1. Evaluate "skipString "#%RAML"" and "ws", returning the former,
  2. Evaluate that previous result and "pfloat", returning the latter,
  3. Evaluate that previous result and "advanceToEOL" (which is really just an alias for blowing through the rest of the line) and return the former.
Which, in essence, is what we were trying to do with our stated goal of "Parsing a RAML version string".  Now that we have this value, we have to get it into an instance of RamlDef, which FParsec does using the operator |>>, which means "apply the result of the Parser on the left (remember the result of a Parser<'a,'b> is 'a) as the first parameter to the function on the right. So if we have a function 'makeRamlDef' that accepts a float, we could do 

Which, as written, is read "the raml parser is the result of the version parser applied to the function makeRamlDef"

And now, if you go to an f# interactive, or a test project or something, and call the FParsec run method


You'll end up with "success"!

Conclusion

So now we've at least got a single parser.  Next time we'll look at combining separate parsers for RAML spec data in sequence as we fill out more of the spec.