Pimping the Scala XML library

Age Mooij

Earlier this week I ran into a missing feature in the Scala xml library and I ended up adding this feature myself, which turned out to be pretty simple.

I was trying to extract the text contents of an element in a piece of XML using the handy \ and \\ methods on scala.xml.NodeSeq. These methods allow you to extract sub-elements from an XML node in a way very similar to XPath, something like this:

val xml = <a><b><c>text</c></b></a>
val c1 = xml \ "b" \ "c"
val c2 = xml \\ "c"
val text = c2.text

The problem I ran into occurred when I tried to use these methods to extract an element when one of its attributes had a certain value.

Based on my experience with "real" XPath, I tried to do the following:

val xml = <a><b id="b1"/><b id="b2"/></a>
val b2 = xml \ "b[@id == b2]"

But that did not work, resulting in an empty NodeSeq. I tried a couple of syntax variants but nothing seemed to work. When I went to have a look at the Scala source code, it was pretty easy to see that support for this type of thing simly did not exist. There is a package called scala.xml.path but it only contains one source file and as far as I could tell, this package and its contents are neither useful nor used from anywhere else in the Scala source code.

Having seen the source code for the \ and \\ methods, I thought it should be pretty simple to add support for extract nodes based on attribute values, and indeed it was. The Scala 2.7.5 source code for those methods was kind of clunky and very imperative (lots of while loops and nested ifs) but the 2.8 code looked a lot more like it should so I chose to base my implementation on the approach taken there.

So without further ado, here is the code for my RichNodeSeq implementation. The full code and the accompanying unit tests can be found on GitHub.

import scala.xml._

object RichNodeSeq {
  val MatchNodeByAttributeValueRegExp = """^(.*)

@(.*)==(.*)

$""".r def apply(nodeSeq: NodeSeq): RichNodeSeq = { new RichNodeSeq(nodeSeq) } } class RichNodeSeq(nodeSeq: NodeSeq) extends NodeSeq { def theSeq = nodeSeq.theSeq import RichNodeSeq._ override def that: String): RichNodeSeq = {
    def filterChildNodes(cond: (Node) => Boolean) = 
      RichNodeSeq(NodeSeq fromSeq (this flatMap (_.child) filter cond))

    that match {
      case MatchNodeByAttributeValueRegExp(element, attribute, value)
             => filterChildNodes(
                isElementWithAttributeValue(_, element.trim, attribute.trim, value.trim))

      case _ => RichNodeSeq(super. that)) } } override def \ that: String): RichNodeSeq = {
    def filterChildNodes(cond: (Node) => Boolean) = 
      RichNodeSeq(NodeSeq fromSeq (this flatMap (_.descendant_or_self) filter cond))

    that match {
      case MatchNodeByAttributeValueRegExp(element, attribute, value)
             => filterChildNodes(
                isElementWithAttributeValue(_, element.trim, attribute.trim, value.trim))

      case _ => RichNodeSeq(super.\ that)) } } private def isElementWithAttributeValue( node: Node, elementName: String, attributeName: String, attributeValue: String): Boolean = { (node.label == elementName) .&&( node.attribute(attributeName) match { case Some(attributes) => attributes(0) == attributeValue case None => false } ) } }

You might notice some funky use of braces here and there, esspecially in the last bit. This should not really be necessary but without the braces and the periods the Scala eclipse plugin could not parse the code correctly.

You can use this functionality by wrapping any existing instance of NodeSeq in an instance of RichNodeSeq. I played around with an implicit conversion so all you would need was an import statement but (AFAIK) I would have had to rename the methods to make that work. Here is an example:

val xml = <a><b id="b1"/><b id="b2"/></a>
val b2 = RichNodeSeq(xml) \ "b[@id == b2]"

So how does it all work ? The basic approach is to extend NodeSeq and to override the two XPath-like methods. I use a regular expression extractor to test whether the argument to the methods contains an attribute value expression. If not, I simply hand over to the super class.

If the argument does contain an attribute value expression, I use the extracted element name, attribute name, and attribute value to match against the children and/or the descendents of the current NodeSeq. The isElementWithAttributeValue method then simply determines whether the element name matches, whether the element has any attributes, and if so whether the attribute matches the attribute value. The nested pattern match in the boolean expression makes it look a bit harder then it should be so I think I will extract that into another method later.

As stated above, the sources and the unit tests are available on GitHub so go ahead and use this code if you think it is useful. And, of course, if you think I did it completely wrong (or totally right), don't hesitate to leave a comment.

Comments (3)

  1. Andrew Phillips - Reply

    July 27, 2009 at 7:53 am

    The implementations of \ and \\ seem to contain a lot of duplicate code - as far as I can see, the only differences seem to be _.child vs. _.descendant_or_self in filterChildNodes and super.\ vs. super.\\ in the final match. Could these be elegantly factored out..?

  2. Age Mooy - Reply

    July 27, 2009 at 9:53 am

    Hey ! This code has only seen one refactoring run until now ! Be nice :)

    I'm sure I can extract some more duplication but I'm still experimenting with how readable Scala code gets after extracting stuff and I did not want to end up with example code that nobody could read.

    Of course you could always fork the code on GitHub and fix it yourself. Surprise me :)

  3. [...] r tut xml reg [...]

Add a Comment