The XML Instance Gamut

Wilfred Springer

If you happen to be in the business of writing software serving XML documents or consuming XML documents - and if you read this post, then there is a fair chance you are - then there is always one big challenge: how do you make sure your service or client is capable of dealing with all of the XML documents you could possibly expect to be passed around?

And if you happen to come from the test-driven world, the answer is obviously: by testing it. However, if you try to do that, things might be harder than you expect at first.

What about schemas?

I clearly remember having to integrate with Google's Local Search Service. We managed to get them send us their schema, but the schema was merely illustrative, rather than normative. In fact, it didn't even 'parse' correctly. It was supposed to be a DTD, but in reality, it wasn't. In that case, you are basically lost. The only thing that you can really do is 'test by poking around', trying to see what the web service is going to reply, and then work into your test harness.

If you do however manage to get a schema, then you are still not done yet. Sure, if it's about SOAP based web services, then you might be able to generate stubs and skeletons, and those stubs and skeletons would give you some guarantee that you are covering most cases. But then there is still a chance that you would not cover for all cases, since - inside your XML document - there might be alternatives for content models, and you might - when you would implement your service - only be dealing with one of them.

If the schema is small, then you can probably figure it out by careful examination. However, if the schema is huge, then the range and variety of XML document instances that you might get will make that impossible. And even if you created the schema yourself, it might sometimes cover for a wider range of options than you expected. (I'm sure, I am not the only one who experienced this. ;-))

XML Instance Generator to the rescue

So, back to test-driven. The good news is, there are tools that take a schema, and generate random instances, basically walking all of the different options. Xmlgen is one of those tools. It's a little bit hard to find these days. If you follow the 'XML Instance Generator' link on Kohsuke's homepage, you will end up in no-mans land. I dug a little further, and found out it's currently hosted at Sun's dev.java.net.

Xmlgen is extremely simple. It takes a schema (any schema language), and will generate any number of sample documents from that. It's exactly what you want, except… It doesn't support all datatypes defined by the XML Schema Datatypes specification. And that's something I ran into more often before.

In fact, I tried to use xmlgen before on a couple of occasions, and each time it broke on missing support for xs:dateTime or xs:pattern restrictions. And there doesn't seem to be an aweful lot of work going into xmlgen to fix that.

Fixing XML Instance Generator

So I figured I'd fix this myself. It turned out adding support for dateTime wasn't all that hard, even though xmlgen does not really have extensions points to implement, so you're basically left with a) hacking the source code big time, or b) hacking it just a little, in order to add plugpoints and then have something else implementing that plugpoint - which is what I did.

Whoops, xs:pattern

Adding support for xs:pattern turned out to be a little tricky. If you are new to this type of restriction, then you should know that it is about restricting content to fit a certain regular expression, as illustrated below.

<simpleType name='better-us-zipcode'>
<restriction base='string'>
<pattern value='[0-9]{5}(-[0-9]{4})?'/>
</restriction>
</simpleType>

Now, if you would have the desire to generate valid data for this restriction, then you should be able to generate text from that regular expression. It turns out there are quite a few Java libraries out there capable of matching text, but there nothing at all for generating text. So I implemented my own. I blogged about it here, and it is hosted here.

Once that was done, extending xmlgen to have support for xs:pattern restrictions was easy. That means that - with just a few changes - I am now able to generate a test set for a fairly complicated schema. And I'm pretty sure that it will cover all cases, as long as I make the number of instance documents big enough.

So, now for a restriction like this:

<xsd:simpleType name = "TimeValue">
<xsd:restriction base = "xsd:string">
<xsd:pattern value = "[0-2][0-9]\:[0-5][0-9](\:[0-5][0-9])?"/>
</xsd:restriction>
</xsd:simpleType>

… it will generate instances like this:

  • 07:36
  • 10:16:26
  • etc.

You can download the modified version of xmlgen here.

Comments (5)

  1. Bruno Vernay - Reply

    October 19, 2009 at 2:07 pm

    Does Databene Benerator http://databene.org/databene-benerator/all-internal-resources/39-documentation/88-xml-schema-support.html fits your need ?
    There are also other tools (I didn't check them all, some are dead): http://databene.org/databene-benerator/similar-products.html

    Did you contact Kohsuke ? Or plan to share your dev more explicitly 🙂

    Regards
    Bruno

  2. Wilfred springer - Reply

    October 19, 2009 at 3:03 pm

    I absolutely will. I just wanted to get this out as quickly as possible, since we were depending on it.

  3. Erik Jan de Wit - Reply

    October 20, 2009 at 9:48 am

    When search for a regular expression / data generator framework you properly never heard of mine testdata-generator. http://kenai.com/projects/testdata-framework/ It can be used to create Proxy objects that create randomly filled test objects. But it also contains a regular expression generator, as you describe. Creating strings that match a regular expression.

  4. David Carver - Reply

    October 21, 2009 at 8:47 pm

    Another area that it is having problems with even with your update is the "language" data type. Always says unable to handle datatype when it encounters one of these.

    • Wilfred Springer - Reply

      October 22, 2009 at 7:51 am

      Good point. I will see when I have time to move my code to Github. When that's done, adding support for language would be a breeze. You basically would have to implement this interface:

      package com.sun.msv.generator;

      import org.relaxng.datatype.Datatype;

      public interface SimpleTypeGenerator {

      <T extends DataType> String generate(T type, ContextProviderImpl context);

      }

Add a Comment