Parsing Text with Scala

Jan Vermeir

In my efforts to teach myself Scala, I tried solving a problem I've tackled in various languages, 6510 assembly code (didn't get far...), pl/sql, Java (with and without Drools) and Groovy among them. Usually I get bogged down in some detail of the language so I never get to reap any actual benefits of my efforts in daily life. The plus side of this never ending task is that by now I don't have to spend effort on defining a problem but instead can start coding right away.
So this story is about how to parse text in Scala and is part of THE software package that will automagically generate a menu for a week and the shopping list for that menu together with whatever else my family will need that week and send it to www.albert.nl and have the groceries delivered to my door.

Thinking big and starting small, I set about (again) parsing a set of recipes so I can extract the list of ingredients necessary for the shopping list. Since all of my practical knowledge of Scala came from reading the book by Martin Odersky, there was some evolution in my code. As we go, I'll try and amuse you with some of the clunky contraptions I've created along the way while trying to shake of old habits.

The aim is to parse a list of recipes like the one below:

name ZucchiniWithMinceMeat
season summer
category easy
cookingtime 45 minutes
preheat oven at 180 degrees
fry 500 gram mincemeat
fry 1 piece Zucchini
add 1 can mashedTomatoes
add 1 bag groundCheese
bake for 30 minutes

and turning it into a list of groceries like this:

mincemeat 500 gram
Zucchini 1 piece
mashedTomatoes 1 can
groundCheese 1 bag

Scala offers a parser for text named StandardTokenParsers in the package scala.util.parsing.combinator.syntactical. The idea is that a specific parser extends StandardTokenParsers. The parser should provide

  • A list of delimiters
  • A list of reserved words
  • A 'main' method that accepts text input and starts the parser
  • A set of production rules that parse part of the input, building an object graph along the way.

    lexical.delimiters is the list of characters used to separate tokens in the input.
    lexical.reserved defines a set of key words, i.e. the while, if, for etc of your language. The snippet below shows my definitions (it is part of com.xebia.cooking.CookBookDSL.scala, which you can find in the zip file attached to this post.).

      lexical.delimiters ++= List("(", ")", ",", " ", "\n")
      lexical.reserved ++= List("name", "season", "fry", "pack", "can", "bag", "piece", "at", "for", "bottle", "gram", "oven", "preheat", "bake", "serve", "with","add","degrees", "category", "cookingtime")
    

    lexical is of type Lexical and is an inherited attribute of StandardTokenParsers.

    The parser is started like this:

      def parseCookBook(cookBook:String):CookBook = {
    		cookbook(new lexical.Scanner(cookBook)) match {
    		  case Success(recipeList, _) => new CookBook(recipeList)
     		  case Failure(msg, _) => throw new ParseErrorException(msg)
    		  case Error(msg, _) => throw new ParseErrorException(msg)
    		}
      }
    

    The input is a string representation of a cookbook. parseCookBook returns an object of type CookBook which we will discuss later. The parsing is kicked off by calling the cookbook method with an instance of a Scanner as a parameter. The Scanner will emit tokens that are matched with the rules of the grammar as we'll see below.
    Using Scala's case construct the result of the parser is translated in either a CookBook instance or results in an exception.

    The parser consists of a set of rules in the form of method definitions that look like the one below:

      def recipeName: Parser[String] =
        "name" ~ ident ^^ { case "name" ~ name => name }
    

    The rule defines the name of a recipe as a String. The text to the left of the ^^ operator defines the tokens that must appear in the input for the rule to match. In this case we expect the string "name" followed by an identifier. ident is defined in StdTokenParsers (along with numericLit which I'll use later) and matches a string of alphanumeric characters. The tokens that are thus parsed are used to select a action to the right of the ^^ sign. "name" ~ name introduces a variable named name that is initialized with the value of the ident token. name is then returned as a result of the rule by placing it to the right of the => sign.

    A slightly more complex example is shown below:

      def ingredient: Parser[Ingredient] = 
    	opt(amount) ~ ident ^^
      	{ case Some(amount) ~ ingredient => (new Ingredient(ingredient,amount))
      	  case  _ ~ ingredient => {new Ingredient(ingredient)}
      	}
    

    This rule matches text like "500 gram mincemeat", "1 piece Zucchini" but also "salad" or "tomatoes". The amount part of the definition is made optional. Instead of just one case clause like we saw in the first rule, this rule has two cases. The first matches an amount followed by an ingredient, the second matches just an ingredient without the amount part.

    One rule can build on the results of other rules as illustrated below:

      def step: Parser[Step] = 
    	(fryStep | preHeatStep | serveWithStep | addStep | bakeStep) ^^ { case step => step }
    

    A step in a recipe is something like "preheat oven at 180 degrees" or "fry 500 gram mincemeat". The clauses to the left are alternatives that are defined as separate rules. The result is always a Step instance. Steps are important because some of them define ingredients which, in some hypothetical future version of my magnum opus, will result in a shopping list (do I hear laughter?).

    One last sample rule shows repeating elements:

      def listOfSeasons: Parser[List[Seasons]] =
        repsep(season ,",") ^^ { listOfSeasons: List[Seasons] => listOfSeasons }
    }
    

    This rule matches one or more occurrences of a season separated by a comma using the repsep clause. The result is a List of objects of type Seasons.

    The parser code uses classes that represent concepts in my domain, like Ingredient, Step and Recipe. I've defined these classes in a separate source file named Recipe.scala. Most of the classes are empty and just serve to define concepts. They may become more interesting later (you know, when I implement this wonderful..., etc, etc...). Others, like StepWithAnIngredient serve to collect ingredients that will be used to create a shopping list.
    Anyway, one thing I learned from writing the classes in the Recipe.scala file is the use the mkString method. Given my Java background I wrote toString methods like this:

      override def toString = {
      	val result:StringBuilder = new StringBuilder("Steps: ") 
      	
       	list.foreach(step => result.append(step).append("\n"))
      	result.toString()
      }	
    

    I was feeling pretty happy about this (you know, remembering to use the foreach method) until I realized this kind of thing is way easier implemented like this:

    	override def toString = "Steps: " + steps.mkString("\n")
    

    You can call mkString on any instance you like and it will return a string representation, which is especially convenient in the case of iterable objects.

    Another powerful Scala feature is the set of methods defined on lists. One such method is filter that accepts a closure executed on each element of the list. If the condition returns true the element is added to the result.
    In the example below, the _ represents a recipe, i.e. an element of the list of recipes.

      def findRecipesBySeason(season:Seasons):Seq[Recipe] = {    
        recipes.filter(_.seasons.contains(season)) 
      }
    

    Another thing that had me thoroughly confused at first was the syntax for constructor parameters. The example below shows the variables amount and unitName defined in the Amount class, the Java way:

    public class Amount {
    	private int amount;
    	private String unitName;
    	public Amount(int amount, String unitName) {
    		this.amount=amount;
    		this.unitName=unitName;
    	}
    	// getters and setters omitted for brevity...
    }
    

    And in Scala:

    class Amount (val amount:Int, val unitName:String) { 
    	override def toString = amount+":"+unitName 
    } 
    

    The parameters following the name of the class are turned into public val's. With my Java background and my compulsion to hit ctrl-1 in Eclipse, it took me awhile to realize how easy and concise the Scala syntax is.

    Note that the Recipe.scala file contains a list of class definitions. In Scala there is no one-to-one mapping from public classes to source files. This allowed me to combine a list of related concepts in a single source file.

    Finally, I found that the best way to develop parsers is by taking very small steps at a time. It is pretty easy to mess things up beyond repair by changing too much at at once.

    This Scala parser is by far the most complete version I've built so far. I'm now starting to like the language, even though my lack of experience gets in the way at least 15 times an hour. Who knows, I might actually create something useful this time...

  • Comments (9)

    1. Andrew Phillips - Reply

      October 22, 2009 at 9:34 am

      I shouldn't have read this nice post in the morning - now I'm hungry ;-) A small point: if I'm not mistaken the mkString methods are defined on the Iterable trait, so you can't call them on everything. See the Iterable documentation.

      I'd also be curious to hear how Scala shapes up in comparison to the other languages you've tried this in. You know, advantages, pains-in-the-a**, that kind of thing...

    2. Jesper Nordenberg - Reply

      October 22, 2009 at 12:25 pm

      Whenever you define an immutable data type, like your Amount type, you should consider making it a case class. That way you will get benefits like equals, hashCode, toString (which you of course can override) and pattern matching for free.

      Also, expressions like:

      "name" ~ ident ^^ { case "name" ~ name => name }

      can be more concisely written:

      "name" ~> ident

    3. Jan Vermeir - Reply

      October 23, 2009 at 7:18 am

      Thanks Jesper, I didn't know expressions could be abbreviated like that. Your shorter version does eliminate a lot of redundancy (so in a way I could have thought of that myself, of course given Scala's focus on conciseness).
      Case classes have me a bit a confused still, but I'll try out your suggestion.

      @Andrew: the only version that ever worked for me is this Scala version. I lost the Drools code and I wrote it like ages ago so I don't remember the details. It did work in a limited way however.
      Needless to say the 6510 assembly language version failed miserably. What did work though was writing an addition to Commodore64 Basic. It worked by extending a symbol table in the Basic interpreter. This allowed me to add custom statements to a program, much like an internal DSL.
      Another version that sort of worked was the Groovy one. I used the MethodMIssing thingy to add new methods to
      existing classes. This allowed me to chain together arbitrary words and handle their meaning in a method. This wasn't very elegant but it was fast and easy to develop. It would allow you to easily implement a internal DSL like the one Camel uses to configure routes. I guess you could do something similar in Scala, but I haven't tried.

    4. Patrick - Reply

      October 23, 2009 at 4:06 pm

      This was a nice read--would you care to post the complete sample code somewhere so I can see the whole thing? Would like to have a complete parser combinator example (that's not another basic-arithmetic-expression-language) to look at. Thanks!

    5. Jan Vermeir - Reply

      October 23, 2009 at 7:13 pm

      Hi Patrick, you can download the sources here:

      http://blog.xebia.com/wp-content/uploads/2009/10/dsl.zip

    6. Patrick - Reply

      October 23, 2009 at 7:58 pm

      Great! Thanks for posting the code.

    7. Gaurav - Reply

      May 11, 2010 at 10:01 am

      Hi Jan,

      I'm trying to tackle the same problem(parse text and turn it into a list of groceries). Only I'm using Python instead of Scala.
      I was wondering if you made any progress on this code.

    8. Jan Vermeir - Reply

      May 12, 2010 at 8:17 am

      Hi Gaurav, I didn't make any further progress. This is mainly because I do my shopping at Albert Heijn, the largest retailer in the Netherlands (www.ah.nl). They are quite ambitious and plan to offer an Iphone app that allows you to navigate through the store based on a shopping list you can create on their website. This may actually lead to me buying one of those iphone thingies.
      On the other hand, what is to stop us from doing a similar thing? The shopping list business case has enabled me to learn new stuff in the past. Would you like to work on the Scala code? Should we create an open source project and a backlog and stuff?

    9. interesting articles - Reply

      January 25, 2012 at 7:04 am

      Is it sad that I'm just now understanding the Futurama parody of this?

    Add a Comment