<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>Xebia Blog</title>
	<atom:link href="http://blog.xebia.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.xebia.com</link>
	<description></description>
	<pubDate>Sat, 04 Jul 2009 13:19:59 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Landmark reached: 20000 unique visitors per month</title>
		<link>http://blog.xebia.com/2009/07/04/landmark-reached-20000-unique-visitors-per-month/</link>
		<comments>http://blog.xebia.com/2009/07/04/landmark-reached-20000-unique-visitors-per-month/#comments</comments>
		<pubDate>Sat, 04 Jul 2009 13:19:59 +0000</pubDate>
		<dc:creator>Serge Beaumont</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2405</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/07/04/landmark-reached-20000-unique-visitors-per-month/&s=compact&t=Landmark reached: 20000 unique visitors per month&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p>It kind of snuck up on us, but when we recently checked the blog visitor statistics, we found that we had gone over 20000 unique visitors per month in april! So to all of you who've stayed with us through the past years, to all the Xebians and ex-Xebians who have been contributing posts, and to all who commented on the blog: a big thank you. We hope we can keep offering the content that will push us to, let's say, 50000! <img src='http://blog.xebia.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/07/04/landmark-reached-20000-unique-visitors-per-month/feed/</wfw:commentRss>
		</item>
		<item>
		<title>@Composite for Unpacking COFF Data</title>
		<link>http://blog.xebia.com/2009/07/04/composite-for-unpacking-coff-data/</link>
		<comments>http://blog.xebia.com/2009/07/04/composite-for-unpacking-coff-data/#comments</comments>
		<pubDate>Sat, 04 Jul 2009 10:37:38 +0000</pubDate>
		<dc:creator>Wilfred Springer</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<category><![CDATA[annotations]]></category>

		<category><![CDATA[bit syntax]]></category>

		<category><![CDATA[erlang]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2387</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/07/04/composite-for-unpacking-coff-data/&s=compact&t=@Composite for Unpacking COFF Data&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p>A while ago, I <a href="http://blog.flotsam.nl/2009/02/bit-syntax-for-java-i.html">compared <a href="http://preon.flotsam.nl/">Preon</a> with Erlang's bit syntax</a>. I looked at one one of the examples from "Programming Erlang" in particular; an example that illustrates how to decode MPEG headers using Erlang. However, this is not the only example in that chapter, so I decided to take a stab at one of the other examples as well.</p>
<p><span id="more-2387"></span></p>
<p>The second example from the bit syntax chapter in "Programming Erlang" is about unpacking <a href="http://en.wikipedia.org/wiki/COFF">COFF</a> data. The thing about COFF is that it doesn't have an IDL-alike language or anything for defining the data structures: all you have is the definition of C++ data structures, such as the one below:</p>
<pre class="brush: c;">
typedef struct _IMAGE_RESOURCE_DIRECTORY {
DWORD Characteristics;
DWORD TimeDateStamp;
WORD MajorVersion;
WORD MinorVersion;
WORD NumberOfNamedEntries;
WORD NumberOfIdEntries;
} IMAGE_RESOURCE_DIRECTORY, *PIMAGE_RESOURCE_DIRECTORY;
</pre>
<p>In his book, Joe Armstrong explains that using Erlang's bit syntax and macro solutions, you would be able to unpack COFF data characterized by the C++ struct above using the Erlang code listed below.</p>
<pre class="brush: text;">
unpack_image_resource_directory(Dir) -&gt;
  &lt;&lt;Characteristics : ?DWORD,
    TimeDateStamp : ?DWORD,
    MajorVersion : ?WORD,
    MinorVersion : ?WORD,
    NumberOfNamedEntries : ?WORD,
    NumberOfIdEntries : ?WORD, _/binary&gt;&gt; = Dir,
...
</pre>
<p>The key message here is that Erlang not only allows you to unpack binary data easily, but that it also allows you to express that clearly maps to the only source of definition of the data structure: the C++ API.</p>
<p>Now, if you would use Preon only, the C++ data structure above would translate to this:</p>
<pre class="brush: java;">
class ImageResourceDirectory {
  @BoundNumber(size=&quot;32&quot;) long characteristics;
  @BoundNumber(size=&quot;32&quot;) long timeDateStamp;
  @BoundNumber(size=&quot;16&quot;) int majorVersion;
  @BoundNumber(size=&quot;16&quot;) int minorVersion;
  @BoundNumber(size=&quot;16&quot;) int numberOfNamedEntries;
  @BoundNumber(size=&quot;16&quot;) int numberOfIdEntries;
}
</pre>
<p>... and you would be able to decode it by this:</p>
<pre class="brush: java;">
Codec&lt;ImageResourceDirectory&gt; codec = Codecs.create(ImageResourceDirectory.class);
Codecs.decode(codec, ...);
</pre>
<p>Now, that's not bad, but it doesn't have that similarity to the original C++ API code the Erlang example has. However, using Andrew Philips' <a href="http://blog.xebia.com/2009/06/23/composite-macro-annotations-for-java/">@Composite framework</a>, you <i>would</i> actually be able to write this:</p>
<pre class="brush: java;">
class ImageResourceDirectory {
  @DWORD long characteristics;
  @DWORD long timeDateStamp;
  @WORD int majorVersion;
  @WORD int minorVersion;
  @WORD int numberOfNamedEntries;
  @WORD int numberOfIdEntries;
}
</pre>
<p>... which is already a lot closer to the original C++ struct than what we had before. </p>
<p>Support for @Composite has not been included in Preon yet. At first sight, there appear to be two ways to deal with it. First of all, it could be build into the framework, by the AnnotatedElements interface all over the place. That should work, and it might actually be the most sensible thing to do.</p>
<p>However, there may be an alternative way to get it woven in. Preon already defines a <a href="http://preon.flotsam.nl/preon-binding/apidocs/nl/flotsam/preon/CodecDecorator.html">CodecDecorator</a> interface that allows you wrap Codec implementations around Codec instances created. Looking at that, I started to think that it might actually be quite attractive to also define a CodecFactoryDecorator, wrapping CodecFactories around other CodecFactories in the chain of responsibility.</p>
<pre class="brush: java;">
public interface CodecFactory {
    &lt;T&gt; Codec&lt;T&gt; create(AnnotatedElement metadata, Class&lt;T&gt; type,
            ResolverContext context);
}
</pre>
<p>The wrappers created by the CodecFactoryDecorator would be able to intercept any reference to annotations passed down below down the chain, and replace it with an AnnotatedElement that uses <a href="http://code.google.com/p/aphillips/source/browse/at-composite/trunk/src/main/java/com/qrmedia/pattern/compositeannotation/api/AnnotatedElements.java">AnnotatedElements</a> to gain access to the annotations. As a consequence, CodecFactories responsible for creating the actual Codec from metadata passed in would get to see Preon annotations only, instead of the @DWORD and @WORD annotations.</p>
<p>It's just a thought. None of this has been implemented yet. Feel free to comment. Expect to see some more of this in the future.</p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/07/04/composite-for-unpacking-coff-data/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Starting out with Scala</title>
		<link>http://blog.xebia.com/2009/07/03/starting-out-with-scala/</link>
		<comments>http://blog.xebia.com/2009/07/03/starting-out-with-scala/#comments</comments>
		<pubDate>Fri, 03 Jul 2009 11:07:23 +0000</pubDate>
		<dc:creator>Arjan Blokzijl</dc:creator>
		
		<category><![CDATA[Scala]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2342</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/07/03/starting-out-with-scala/&s=compact&t=Starting out with Scala&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p>Scala has become more and more popular over the recent months/years. Its hybrid nature of being an imperative as well as functional language attracts a crowd from the Java world as well as functional fundamentalists coming from the world where statements like x=x+1 are looked at with the utter disbelief. It has been stated that Scala is 'Java as it should have been', but there are also numerous complaints about the language and its features (like not being side effect free, overly complex, too much of everything, too much abstraction, having a weird syntax, etc). The latter might actually be a proof of its popularity, since people seem to be actually using the language instead of just looking at it briefly and stopping, tired but happy, after having written hello world with it.<br />
In this blog post, I'll give you some (hopefully) useful tips how to best start if you want to learn this language, which is one of the candidates become 'our next big language' and surpass Java in this respect.<br />
<span id="more-2342"></span></p>
<p><strong>Pick an IDE, or not</strong><br />
The first problem you might encounter is that the various Scala IDE plugins have not yet reached the maturity of the Java support that you might have been accustomed to. Lots of hard work is in progress on this, but you still might experience hickups and freezes (as I have) of your favorite IDE. At this moment, the Netbeans plugin seems to be the most stable (this is the IDE that David Pollak, creator of Lift, seems to use). If you're an eclipse user, you'll probably want the nightly build version of the Scala plugin, since that's the only one actively under development. This means that you need the Scala trunk to work with, however.</p>
<p>There's also plenty of support for non-IDE's, and recently I've tried switching back to emacs again. For some, this brings back memories to good old university days, others may find it the utmost horror. I must say, I found it to be a rather soothing experience after years of development using Eclipse. You'll need to do some setup work to get a good Scala experience. An excellent blog post describing how to setup your emacs environment for Scala using Yasnippet, exuberant ctags and maven can be found <a href=" http://thegreylensmansview.blogspot.com/2009/02/stone-tools-and-scala-development-part.html">here</a>.</p>
<p><strong>Try out the trunk</strong><br />
The latest released version of Scala is 2.7.5. The next version will be 2.8.0, but its release date is still not known. However, the invaluable Paul Philips has done many bug fixes and improvements in the Scala compiler and the REPL, all hanging out in the trunk. To take full advantage of this, check out the sources and build them. Note that the Scala collections API has been redesigned in the trunk, and differs from the 2.7.5 release. To get a good understanding of the new design, and the API in general, you're encouraged to read the new collection SIP (Scala improvement proposal) <a href="http://www.scala-lang.org/sites/default/files/sids/odersky/Sat,%202009-05-30,%2012:51/collections.pdf">available here</a>.</p>
<p><strong>Read a book</strong><br />
The standard work is Programming in Scala, co-authored by Martin Odersky (creator of Scala) himself. This book is lengthy but very thorough, providing plenty of examples. Another book that has seen the light is Beginning Scala by David Pollak (already mentioned). I have not read this, but reviews suggest this is also an excellent start. More books will hit the market this year. </p>
<p><strong>Check out Lift</strong><br />
Eventually it is possible that you might start to really like Scala and want to use it in the enterprise. <a href="http://liftweb.net/">Lift</a> is the first web framework written in Scala, created by <a href="http://blog.lostlake.org">David Pollak</a>. Downloading and creating a simple web application to work is a no-brainer. The distribution comes with plenty of maven archetypes to get a basic web application ready in under than five minutes. Lift has its own database mapping framework, but there's JPA support as well, if you like this. JTA integration is also being worked on.</p>
<p><strong>Read some articles</strong><br />
There are loads of (both academic and slightly less academic) articles about Scala. For example, if you want to know the rationale behind Scala's actor library design, read the scala  <a href="http://lamp.epfl.ch/~phaller/actors.html">actor papers</a>.<br />
Martin Odersky's homepage also contains a long list of <a href="http://lampwww.epfl.ch/~odersky">his publications</a>.<br />
It provides lots of material to provide insight in the design of Scala and its libraries. If you're not that academically inclined and from a Java background, try <a href="http://www.ibm.com/developerworks/java/library/j-scala01228.html">The busy java developer's guide to Scala</a>, by Ted Neward. Provides a nice introduction.</p>
<p><strong>Read some blogs.</strong><br />
Plenty of blogs available, here are some I like:</p>
<ul>
<li><a href="http://www.codecommit.com/blog/scala/">Daniel Spiewak</a> Lots of useful Scala examples.</li>
<li><a href="http://dibblego.wordpress.com">Tony morris</a>. He's the creator of <a href="http://functionaljava.org/">functional java</a> and of <a href="http://code.google.com/p/scalaz/">Scalaz</a>. Beware however, if you check this out you might end up in deep functional waters where Monads reign.</li>
<li><a href="http://jonasboner.com">Jonas Boner</a>He has lots of experience of using Scala in the real world, some of which he has blogged extensively about. Well worth the read.</li>
<li><a href="http://www.planetscala.com">Planet Scala</a> Perhaps you don't even need more than this. This aggregates numerous blog posts (including the one mentioned above) about Scala.
</ul>
<p><strong>Join IRC</strong><br />
Scala has an IRC channel, which can be found <a href="irc://irc.freenode.net/scala">here</a>. Whenever you're stuck at some problem you're working at, there's always the mailing lists, but many knowledgeable people hang out on a daily basis on the IRC channel. Use <a href="http://paste.pocoo.org/">pocoo</a> to show the code you're stuck with, and you'll get an answer in no time. </p>
<p><strong>Start coding.</strong><br />
Lift might be the choice if you really want to write some useful, practical real world web applications. If you're totally impractical like me, pick a few problems from <a href="http://projecteuler.net/">project euler</a>, or the 99 problems projects (solutions in Scala can be found <a href="http://aperiodic.net/phil/scala/s-99">here</a>. Many excellent programmers have taken this path, it's fun and an excellent way to play around with the core Scala API. </p>
<p>To put into practice what I preach here, a few lines of Scala code to either whet your appetite or make you run away screaming. Note that my Scala knowledge is still in the pre-kindergarten stage, I'm barely able to speak, so laugh at will when viewing this code. </p>
<p>First, a randomly picked Euler problem, <a href="http://projecteuler.net/index.php?section=problems&id=48">number 48</a>, just because I like onliners:</p>
<pre class="brush: scala;">
(1 to 1000).map(x =&gt; new java.math.BigDecimal(x).pow(x)).reduceLeft((a,b) =&gt; a.add(b)).remainder(new java.math.BigDecimal(10).pow(10))
</pre>
<p>This one liner is not nearly as concise and neat as the Haskell version, but it will do. If nothing else, it has at least has the tiny merit of showing Scala - Java interoperability. </p>
<p>Another randomly picked and slightly more complex one, problem <a href="http://projecteuler.net/index.php?section=problems&id=21">number 21</a>:</p>
<pre class="brush: scala;">
object Euler21 {
  def sumOfDivisors(number: Int): Int = {
    List.range(1, number).filter{i =&gt; (number % i) == 0}.foldLeft(0){(a,b) =&gt; a+b}
  }

  def solver(): Int  = {
      (for (i &lt;- 0 until 10000; di = sumOfDivisors(i); if(di &gt; i &amp;&amp; sumOfDivisors(di)  == i) ) yield (i+di)).foldLeft(0){(a,b) =&gt; a+b};
  }
}
</pre>
<p>Enjoy.</p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/07/03/starting-out-with-scala/feed/</wfw:commentRss>
		</item>
		<item>
		<title>ShuntingYard algorithm in Scala</title>
		<link>http://blog.xebia.com/2009/07/02/shuntingyard-algorithm-in-scala/</link>
		<comments>http://blog.xebia.com/2009/07/02/shuntingyard-algorithm-in-scala/#comments</comments>
		<pubDate>Thu, 02 Jul 2009 10:21:16 +0000</pubDate>
		<dc:creator>Jeroen van Erp</dc:creator>
		
		<category><![CDATA[Functional Programming]]></category>

		<category><![CDATA[Scala]]></category>

		<category><![CDATA[kata]]></category>

		<category><![CDATA[shunting]]></category>

		<category><![CDATA[yard]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2331</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/07/02/shuntingyard-algorithm-in-scala/&s=compact&t=ShuntingYard algorithm in Scala&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p>Last week I came across an interesting "coding kata" by Brett Schuchert on the <a href="http://blog.objectmentor.com/articles/2009/06/24/shunting-yard-algorithm-kata">Object Mentor blog</a>. The trick of a kata is that you grow the program step-by-step using tests, just like a <a href="http://en.wikipedia.org/wiki/Kata">kata</a> in karate is tought to a student. The problem of this kata was the <a href="http://en.wikipedia.org/wiki/Shunting_yard_algorithm">Shunting Yard algorithm</a> of <a href="http://en.wikipedia.org/wiki/Edsger_Dijkstra">Dijkstra</a>. I wanted to see if I could implement this kata in Scala.<br />
<span id="more-2331"></span><br />
Instead of writing the algorithm as described, the trick in a coding kata is that you first write a test, and then make the test pass. And subsequently each time adding a test and keeping it all in the green. Brett described the test-cases in his blog entry.</p>
<p>The Shunting Yard algorithm is used to convert a mathematical function in infix notation to a reverse polish or postfix notation. For instance, it can convert an expression like <em>3+4</em> to <em>3 4 +</em>.</p>
<p>In order to implement this algorithm, one needs to do string parsing to break up the infix string. Scala has the <a href="http://www.codecommit.com/blog/scala/the-magic-behind-parser-combinators">parser combinators</a> that can do just this. Using a parser I build an Abstract Syntax Tree (AST) which is a representation of the formula in a tree form. An AST for 3 + 4 and (3 + 4) * 4 looks like:<br />
<img src="http://blog.xebia.com/wp-content/uploads/2009/07/ast.png" alt="ast" title="ast" width="374" height="207" class="aligncenter size-full wp-image-2336" /></p>
<p>The AST representation of the formula can then easily be printed in a reverse polish notation, as the code shows. The final Parser looks like this:</p>
<pre class="brush: scala;">
import scala.util.parsing.combinator.syntactical._

abstract class Expr {
  def rpn:String
}
case class BinaryOperator(lhs:Expr, op:String, rhs:Expr) extends Expr {
	def rpn:String = lhs.rpn + &quot; &quot; + rhs.rpn + &quot; &quot; + op
}
case class Number(v:String) extends Expr { def rpn:String = v }
case class Variable(v:String) extends Expr { def rpn:String = v }
case class Function(f:String, e:List[Expr]) extends Expr { def rpn:String = {
	var s = &quot;&quot;
	e.foreach { x =&gt; s += x.rpn + &quot; &quot; }
	s += f
	return s
  }
}
object ShuntingYard extends StandardTokenParsers {
    lexical.delimiters ++= List(&quot;+&quot;,&quot;-&quot;,&quot;*&quot;,&quot;/&quot;, &quot;^&quot;,&quot;(&quot;,&quot;)&quot;,&quot;,&quot;)

    def value :Parser[Expr] = numericLit ^^ { s =&gt; Number(s) }
    def variable:Parser[Expr] =  ident ^^ { s =&gt; Variable(s) }
    def parens:Parser[Expr] = &quot;(&quot; ~&gt; expr &lt;~ &quot;)&quot;

    def argument:Parser[Expr] = expr &lt;~ (&quot;,&quot;?)
    def func:Parser[Expr] = ( ident ~ &quot;(&quot; ~ (argument+) ~ &quot;)&quot; ^^ { case f ~ _ ~ e ~ _ =&gt; Function(f, e) })

    def term = (value | parens | func | variable)

    // Needed to define recursive because ^ is right-associative
    def pow :Parser[Expr] = ( term ~ &quot;^&quot; ~ pow ^^ {case left ~ _ ~ right =&gt; BinaryOperator(left, &quot;^&quot;, right) }|
    			term)
    def factor = pow * (&quot;*&quot; ^^^ { (left:Expr, right:Expr) =&gt; BinaryOperator(left, &quot;*&quot;, right) } |
                        &quot;/&quot; ^^^ { (left:Expr, right:Expr) =&gt; BinaryOperator(left, &quot;/&quot;, right) } )
    def sum =  factor * (&quot;+&quot; ^^^ { (left:Expr, right:Expr) =&gt; BinaryOperator(left, &quot;+&quot;, right) } |
    					&quot;-&quot; ^^^ { (left:Expr, right:Expr) =&gt; BinaryOperator(left, &quot;-&quot;, right) } )
    def expr = ( sum | term )

    def parse(s:String) = {
        val tokens = new lexical.Scanner(s)
        phrase(expr)(tokens)
    }

    def shunt(exprstr: String) : String = exprstr match {
      case null =&gt; return &quot;&quot;
      case &quot;&quot; =&gt; return &quot;&quot;
      case _ =&gt;
    	parse(exprstr) match {
            case Success(tree, _) =&gt;
                println(&quot;Tree: &quot;+tree)
                val v = tree.rpn
                println(&quot;RPN: &quot;+v)
                return v
            case e: NoSuccess =&gt; Console.err.println(e)
            	return e.toString
        }
    }
}
</pre>
<p>Of course I tested this using the following unit test, of which I will show only a small part, one can easily complete this by looking at Brett's examples:</p>
<pre class="brush: scala;">
import org.junit.Test
import org.junit.Assert._

class ShuntingYardTest {
  @Test
  def test_11() {
    assertEquals(&quot;4 g f&quot;, ShuntingYard.shunt(&quot;f(g(4))&quot;))
  }

  @Test
  def test_12() {
    assertEquals(&quot;3 4 19 f&quot;, ShuntingYard.shunt(&quot;f(3, 4, 19)&quot;))
  }

  @Test
  def test_13() {
    assertEquals(&quot;3 4 2 * 1 5 - 2 3 ^ ^ / +&quot;, ShuntingYard.shunt(&quot;3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3&quot;))
  }

  @Test
  def test_14 () {
    assertEquals(&quot;4 5 + 1 a 2 ^ + 8 b + 10 * f&quot;, ShuntingYard.shunt(&quot;f(4+5,1+a^2,(8+b)*10)&quot;))
  }
}
</pre>
<p>Of all the tests that Brett defined, only one doesn't pass, can you spot which one? Of course as this is one of my first exercises in Scala, it probably isn't optimal yet, any improvements are welcome in the comments!</p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/07/02/shuntingyard-algorithm-in-scala/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Thinking MapReduce with Hadoop</title>
		<link>http://blog.xebia.com/2009/07/02/thinking-mapreduce-with-hadoop/</link>
		<comments>http://blog.xebia.com/2009/07/02/thinking-mapreduce-with-hadoop/#comments</comments>
		<pubDate>Thu, 02 Jul 2009 07:05:20 +0000</pubDate>
		<dc:creator>Maarten Winkels</dc:creator>
		
		<category><![CDATA[hadoop]]></category>

		<category><![CDATA[mapreduce]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2317</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/07/02/thinking-mapreduce-with-hadoop/&s=compact&t=Thinking MapReduce with Hadoop&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p><a href="http://hadoop.apache.org/">Apache Hadoop</a> promises "a software platform that lets one easily write and run applications that process vast amounts of data". Sure enough, when reading the documentation, descriptions like:</p>
<pre>
(input) &lt;k1, v1> -> map -> &lt;k2, v2> -> combine -> &lt;k2, v2> -> reduce -> &lt;k3, v3> (output)
</pre>
<p>Are simple enough to read and understand, but how do you apply MapReduce to a problem you face in a real-life project?</p>
<p>This blog tries to give some insight into how to apply MapReduce with Hadoop.</p>
<p><span id="more-2317"></span></p>
<h2>What is Hadoop?</h2>
<p>Hadoop is basically two things:</p>
<ol>
<li>A distributed file system -- HDFS</li>
<li>A MapReduce framework that allows algorithms to work on data in the distributed file system in parallel</li>
</ol>
<p>The <a href="http://hadoop.apache.org/core/docs/current/hdfs_design.html">Hadoop Distributed File System (HDFS)</a> is really the heart of Hadoop. It provides scalability, reliability and performance at a low cost. The system is designed to run on commodity hardware. Although the system is written in Java, there are other ways to access and use it.</p>
<p>MapReduce is a software framework that allows computation to run on a cluster. It uses HDFS for data-proximity: The computation will be distributed and run in parallel on the cluster and each process will access and process data that is available locally on its node. This gives a major performance boost. Furthermore the framework provides reliability, because processes that fail will automatically restarted at other nodes.</p>
<h2>When to apply MapReduce?</h2>
<p>MapReduce is only useful for systems that process large amounts of data. There is a overhead for starting tasks and there is always the network overhead. When talking about large amounts of data we mean GBs and above rather then MBs.</p>
<p>Another important aspect is the data usage pattern in your application. If you need to 'randomly' read and write data, for example based on a certain request coming in, MapReduce cannot help you. MapReduce really shines when data is read in a batch-like streaming manner.</p>
<p>The final requirement for using MapReduce is that the algorithm can be described as a map-and-reduce process. In this blog I want to focus on this last aspect.</p>
<h2>How to apply MapReduce? - Another example</h2>
<p>So how do you describe an algorithm in a MapReduce manner? To illustrate, nothing works better than an example. As with every example that I have seen for Hadoop, it is a bit academic. What I'm trying to explain is how a MapReduce algorithm is different from a normal approach and how to go about designing that algorithm.</p>
<p>The main thing with a MapReduce algorithm is that it reasons about &lt; key , value > pairs all along, from the input format to the output format, if necessary using synthetic keys. If your input is a simple flat file, it will by default break it up on line ends and provide the offset into the file as key and the line as value. The main strength of the algorithm lies with the fact that between the map and the reduce phase, it will sort the data by key. The framework will then provide all data with the same key to the same reducer instance. Any successful MapReduce algorithm should leverage this mechanism.</p>
<p><b>The problem - Finding Anagrams</b><br />
Say you have to find <a href="http://en.wikipedia.org/wiki/Anagram">Anagrams</a> in a very large input file. How would implement this?</p>
<p>I think a first attempt would have some sort of function like this:</p>
<pre class="java">&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #993333;">boolean</span> isAnagram<span style="color: #66cc66;">&#40;</span><a href="http://www.google.com/search?hl=en&amp;q=allinurl%3AString+java.sun.com&amp;bntI=I%27m%20Feeling%20Lucky"><span style="color: #aaaadd; font-weight: bold;">String</span></a> first, <a href="http://www.google.com/search?hl=en&amp;q=allinurl%3AString+java.sun.com&amp;bntI=I%27m%20Feeling%20Lucky"><span style="color: #aaaadd; font-weight: bold;">String</span></a> second<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #808080; font-style: italic;">// Checks that the two inputs are anagrams, by checking they have all the same characters.</span>
    <span style="color: #808080; font-style: italic;">// Left as exercise for the user...</span>
  <span style="color: #66cc66;">&#125;</span>
&nbsp;</pre>
<p>The application would have to somehow execute this function on all pairs of words in the input. However fast this method would be, the overall execution would still take quite some time.</p>
<p><b>Hadoopifying...</b><br />
How do you now design a MapReduce algorithm that will give the desired answer? The key lies in finding a function that will produce the same key for all words that are anagrams. Applying this in the map phase will use the power of the MapReduce framework to deliver all words that are anagrams to the same reducer. The solution, when found, is deceivingly simple as usual:</p>
<pre class="java">&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <a href="http://www.google.com/search?hl=en&amp;q=allinurl%3AString+java.sun.com&amp;bntI=I%27m%20Feeling%20Lucky"><span style="color: #aaaadd; font-weight: bold;">String</span></a> sortCharacters<span style="color: #66cc66;">&#40;</span><a href="http://www.google.com/search?hl=en&amp;q=allinurl%3AString+java.sun.com&amp;bntI=I%27m%20Feeling%20Lucky"><span style="color: #aaaadd; font-weight: bold;">String</span></a> input<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #993333;">char</span><span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span> cs = input.<span style="color: #006600;">toCharArray</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
    <a href="http://www.google.com/search?hl=en&amp;q=allinurl%3AArrays+java.sun.com&amp;bntI=I%27m%20Feeling%20Lucky"><span style="color: #aaaadd; font-weight: bold;">Arrays</span></a>.<span style="color: #006600;">sort</span><span style="color: #66cc66;">&#40;</span>cs<span style="color: #66cc66;">&#41;</span>;
    <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> <a href="http://www.google.com/search?hl=en&amp;q=allinurl%3AString+java.sun.com&amp;bntI=I%27m%20Feeling%20Lucky"><span style="color: #aaaadd; font-weight: bold;">String</span></a><span style="color: #66cc66;">&#40;</span>cs<span style="color: #66cc66;">&#41;</span>;
  <span style="color: #66cc66;">&#125;</span>
&nbsp;</pre>
<p>By sorting all the characters in all the words in the input, all anagrams will have the same key:</p>
<pre>
  aspired -> adeiprs
  despair -> adeiprs
</pre>
<p>Now the list of characters to the right has no meaning, but all anagrams will have exactly the same result for this function.</p>
<p><b>Implementation</b><br />
Once the algorithm is found the implementation using Hadoop is quite straightforward and simple (though pretty long...).</p>
<pre class="brush: java;">
public class AnagramFinder extends Configured implements Tool {

  public static class Mapper extends org.apache.hadoop.mapreduce.Mapper&lt;LongWritable, Text, Text, Text&gt; {

    private Text sortedText = new Text();
    private Text outputValue = new Text();

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer tokenizer = new StringTokenizer(value.toString(),
          &quot; \t\n\r\f,.:()!?&quot;, false);
      while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken().trim().toLowerCase();
        sortedText.set(sort(token));
        outputValue.set(token);
        context.write(sortedText, outputValue);
      }
    }

    protected String sort(String input) {
      char[] cs = input.toCharArray();
      Arrays.sort(cs);
      return new String(cs);
    }

  }

  public static class Combiner extends org.apache.hadoop.mapreduce.Reducer&lt;Text, Text, Text, Text&gt; {

    protected void reduce(Text key, Iterable&lt;Text&gt; values, Context context) throws IOException, InterruptedException {
      Set&lt;Text&gt; uniques = new HashSet&lt;Text&gt;();
      for (Text value : values) {
        if (uniques.add(value)) {
          context.write(key, value);
        }
      }
    }
  }

  public static class Reducer extends org.apache.hadoop.mapreduce.Reducer&lt;Text, Text, IntWritable, Text&gt; {

    private IntWritable count = new IntWritable();
    private Text outputValue = new Text();

    protected void reduce(Text key, Iterable&lt;Text&gt; values, Context context) throws IOException, InterruptedException {
      Set&lt;Text&gt; uniques = new HashSet&lt;Text&gt;();
      int size = 0;
      StringBuilder builder = new StringBuilder();
      for (Text value : values) {
        if (uniques.add(value)) {
          size++;
          builder.append(value.toString());
          builder.append(',');
        }
      }
      builder.setLength(builder.length() - 1);

      if (size &gt; 1) {
        count.set(size);
        outputValue.set(builder.toString());
        context.write(count, outputValue);
      }
    }

  }

  public int run(String[] args) throws Exception {
    Path inputPath = new Path(args[0]);
    Path outputPath = new Path(args[1]);

    Job job = new Job(getConf(), &quot;Anagram Finder&quot;);

    job.setJarByClass(AnagramFinder.class);

    FileInputFormat.setInputPaths(job, inputPath);
    FileOutputFormat.setOutputPath(job, outputPath);

    job.setMapperClass(Mapper.class);
    job.setCombinerClass(Combiner.class);
    job.setReducerClass(Reducer.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

    return job.waitForCompletion(false) ? 0 : -1;
  }

  public static void main(String[] args) throws Exception {
    System.exit(ToolRunner.run(new Configuration(), new AnagramFinder(), args));
  }
}
</pre>
<p>The main parts of the implementation are the following:</p>
<p><b>Mapper</b> - Breaks up the input text in tokens (filtering some common punctuation marks) and applies the character sorting to arrive at the required key.<br />
<b>Combiner</b> (optional) - Removes duplicate values from the input.<br />
<b>Reducer</b> - Collects anagrams and outputs the number of anagrams (key) and all the words concatenated (value).<br />
<b>Main and Run</b> - This code configures the job to run on the MapReduce framework.</p>
<p>The Combiner is used to do some preprocessing for the reducer. The main reason for this is that results from Mappers that are run on different nodes will be processed on the same node and will thus have to travel the network. To minimize network load, the Combiner might reduce the number of &lt; key , value > pairs that will be processed, as shown here by filtering out duplicates. The process can however not rely on all &lt; key , value > pairs to be processed by a single Combiner, so the Reducer will also have to remove duplicates.</p>
<p>It is interesting to see that the concept Anagram doesn't materialize anywhere in this code. The fact that the code finds anagrams follows from the fact that all anagrams will have the same value from the sort function and that is used as the map output key. This might be quite confusing for readers.</p>
<h2>Conclusion</h2>
<p>The main challenge posed by Hadoop is coming up with a good algorithm for MapReduce applications. The algorithm will mostly be the result of the whole MapReduce process and might not be easy to understand from the code. This is because some of the functionality that the framework provides might be key to the algorithm. Good documentation that describes the whole process is vital to overcome this problem. Once an algorithm is designed, impelementing it in Hadoop is quite straightforward.</p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/07/02/thinking-mapreduce-with-hadoop/feed/</wfw:commentRss>
		</item>
		<item>
		<title>J(2)ee, the basics and beyond</title>
		<link>http://blog.xebia.com/2009/06/30/j2ee-the-basics-and-beyond-starting-threads/</link>
		<comments>http://blog.xebia.com/2009/06/30/j2ee-the-basics-and-beyond-starting-threads/#comments</comments>
		<pubDate>Tue, 30 Jun 2009 19:33:41 +0000</pubDate>
		<dc:creator>Sander Hautvast</dc:creator>
		
		<category><![CDATA[Concurrency Control]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Middleware]]></category>

		<category><![CDATA[websphere]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2305</guid>
		<description><![CDATA[In this series I want to address some topics that are old and well known, but still seem to puzzle developers and administrators in a j2ee environment. Think of anything in or around an application server. When talking of application servers I mostly refer to websphere. Sadly I have no real experience using any other. Yet I aim to keep a broad perspective, not to narrow the audience. The level should be beginner to intermediate.]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/06/30/j2ee-the-basics-and-beyond-starting-threads/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Jeff Sutherland @ nlscrum</title>
		<link>http://blog.xebia.com/2009/06/29/jeff-sutherland-nlscrum/</link>
		<comments>http://blog.xebia.com/2009/06/29/jeff-sutherland-nlscrum/#comments</comments>
		<pubDate>Mon, 29 Jun 2009 20:51:08 +0000</pubDate>
		<dc:creator>Marco Mulder</dc:creator>
		
		<category><![CDATA[Agile]]></category>

		<category><![CDATA[Scrum]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2267</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/06/29/jeff-sutherland-nlscrum/&s=compact&t=Jeff Sutherland @ nlscrum&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p>Last week I co-organized an <a href="http://www.nlscrum.org/">nlscrum</a> event with a very special guest: Jeff Sutherland. After rushing with him from the airport to our Xebia office, Jeff gave a very inspiring <a href="http://jeffsutherland.com/scrum/AgileArchitectureRedPillBluePillv3.pdf">presentation</a>.</p>
<p><span id="more-2267"></span></p>
<p>Jeff talked about the dramatic difference between those teams that take the red pill (that Morpheus offered Neo in the Matrix), and the blue pill that most people take. After his presentation, many attendees felt like they had just taken the red pill. I hope that this inspiration lasted long enough to have some effect at work, where countless small and big obstacles cause many people's pill to be blue.</p>
<p>After dinner, which was accompanied by a nice Italian ice cream booth, we had a question and answer/discussion session in which Jeff touched many Scrum related topics. Usually at nlscrum events, we host an <a href="http://www.openspaceworld.org/cgi/wiki.cgi?AboutOpenSpace">OpenSpace</a> to give members of the nlscrum community a chance to share experiences. This time, we decided to opt for a different format to give a maximum of attendees the opportunity to learn from and get inspired by Jeff.  </p>
<p>The topic that triggered most debate was a discussion about Kanban versus Scrum. According to Jeff, Lean principles and techniques are important to do Agile software development projects successfully. In fact, last month Jeff gave a Deep Lean course with Henrik Kniberg and the Poppendiecks. For those interested, you can find the course material <a href="http://www.crisp.se/deeplean/material.html">online</a>, including a presentation about Kanban versus Scrum by Henrik Kniberg.</p>
<p>All in all, I'm very glad by the way this special nlscum event turned out to be. We had approximately 65 attendees and got a lot of positive feedback. I hope to see many of them again on our OpenSpaces. So, for all of you in the Netherlands who use Scrum or plan to do so, come to our <a href="http://www.nlscrum.org/">nlscrum</a> events to get inspired and learn from your peers!</p>
<p><img src="http://farm4.static.flickr.com/3570/3658690410_91af61dabb.jpg?v=0" alt="" /></p>
<p><img src="http://farm3.static.flickr.com/2458/3658778758_233235eb1e.jpg?v=0" alt="" /><br />
<em>Photos: Laurens Bonnema</em></p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/06/29/jeff-sutherland-nlscrum/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Web performance in seven steps; step 3: test representatively</title>
		<link>http://blog.xebia.com/2009/06/29/web-performance-in-seven-steps-step-3-test-representatively/</link>
		<comments>http://blog.xebia.com/2009/06/29/web-performance-in-seven-steps-step-3-test-representatively/#comments</comments>
		<pubDate>Mon, 29 Jun 2009 20:34:44 +0000</pubDate>
		<dc:creator>Jeroen Borgers</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<category><![CDATA[Performance]]></category>

		<category><![CDATA[Quality Assurance]]></category>

		<category><![CDATA[Testing]]></category>

		<category><![CDATA[Tools]]></category>

		<category><![CDATA[JMeter]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2269</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/06/29/web-performance-in-seven-steps-step-3-test-representatively/&s=compact&t=Web performance in seven steps; step 3: test representatively&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p>Last time I <a href="http://blog.xebia.com/2009/06/15/web-performance-in-seven-steps-step-2-execute-a-proof-of-concept/">blogged </a>about the importance of benchmarking the architecture and new technology in a Proof of Concept for Performance. This time I’ll deal with the importance of representative performance testing. </p>
<p>Slowness of applications in development environments is often neglected with the rationale that faster hardware in the production environment will solve this problem. However, whether this is really true can only be predicted with a test on a representative environment and in a representative way. In such an environment, there needs to be more representative than just the hardware.<br />
<span id="more-2269"></span><br />
I have experienced multiple times that a database query on the test database with 1000 customers took only less than 10 ms., while on the production database with 100.000 customers this turned out to take tens of seconds, because of missing indexes. So, if the development team does not test with a full, complete database, going to production may lead to some surprises. </p>
<p>It is also important that the number of concurrent users and their behavior is well simulated in the test. Furthermore, care should be taken to take caching effects into account: if the test continuously requests for the same product by the same customer, this data will be in database or query cache the second and following times. This will speed up the request considerably and be much faster than with many customers and products. This test is therefore not representative for the real situation. </p>
<p>A suitable performance test tool and performance expertise is necessary to create a valuable test. The most popular open source performance test tool is Apache JMeter, see the next figure.</p>
<p><img src="http://blog.xebia.com/wp-content/uploads/2009/06/web-shop-case-study-jmeter.jpg" alt="Run of a performance test in JMeter." title="web-shop-case-study-jmeter" width="600" height="356" class="size-full wp-image-2268" /><br />
Figure: Screenshot of a run of a performance test in Apache JMeter.</p>
<p>This is a tool made by programmers, for programmers. Test scripts can be created with visual elements like a HTTP request, which can be recorded and configured. Many are available and if you need more, you can always fall back on a BeanShell element in which you can manipulate the request, response and various JMeter variables. If that even does not meet your needs yet, you can extend JMeter source code and develop your own elements. Because of its for-programmers nature, it is less suited for the average tester. Also reporting features and maintainability of the scripts are both not so great. Therefore, commercial tools like HP Mercury LoadRunner, Borland SilkPerformer or Neotys’ Neoload may be good alternatives for companies. </p>
<p>Next time I’ll blog about step 4: continuous performance testing.</p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/06/29/web-performance-in-seven-steps-step-3-test-representatively/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Open Letter to Geertjan Wielenga</title>
		<link>http://blog.xebia.com/2009/06/26/open-letter-to-geertjan-wielenga/</link>
		<comments>http://blog.xebia.com/2009/06/26/open-letter-to-geertjan-wielenga/#comments</comments>
		<pubDate>Fri, 26 Jun 2009 09:40:21 +0000</pubDate>
		<dc:creator>Wilfred Springer</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<category><![CDATA[fluent interface]]></category>

		<category><![CDATA[netbeans]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2225</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/06/26/open-letter-to-geertjan-wielenga/&s=compact&t=Open Letter to Geertjan Wielenga&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><p><a href="http://blogs.sun.com/geertjan/">Geertjan Wielenga</a> has been trying to pull me back into the NetBeans community for <a href="http://blogs.sun.com/geertjan/entry/javadoc_code_completion">a couple of years in a row now</a>. I admire his perseverance; if this is typical for the whole NetBeans team, then Eclipse is going out of the window some day soon.</p>
<p><span id="more-2225"></span></p>
<p>There is <em>one</em> thing - really, just one thing - that would make me drop Eclipse <em>immediately</em> in favor of NetBeans. That's having better support for fluent interfaces in the way the IDE formats source code.</p>
<p>Now, I've been working on a couple of fluent interfaces over the last couple of years, and it's just awesome. It will always result in code that is easier to read, and it doesn't cost you a dime; you get the benefits of Java 5 type safety , without sacrificing readability.</p>
<p>Let's take this <a href="http://pecia.flotsam.nl/">Pecia</a> example:</p>
<pre class="brush: java;">
doc
    .section(&quot;Introduction&quot;)
        .para()
            .text(&quot;This is a document. Make sure you also check out the &quot;)
            .emphasis(&quot;next&quot;).text(&quot; section.&quot;)
        .end()
    .end()
    .section(&quot;Conclusion&quot;)
        .para(&quot;That's all folks.&quot;)
    .end()
.end();
</pre>
<p>Now, IMHO, this is pretty easy to read and understand. The layout of your code clearly reflects the structure of the underlying document model. However, if you press Command-Shift-F to format your code, this is what you get:</p>
<pre class="brush: java;">
doc.section(&quot;Introduction&quot;).para().text(&quot;This is a document. Make sure you also check out the &quot;).emphasis(&quot;next&quot;).text(&quot; section.&quot;).end().end().section(&quot;Conclusion&quot;).para(&quot;That's all folks.&quot;).end().end();
</pre>
<p>Not quite as good as what we had before.</p>
<p>Now, I would <em>love</em> to have a solution that basically prevented this. I was thinking about this for a while, and I could imagine introducing a couple of annotations for it. Annotations for fluent interfaces. Annotations that basically tell your IDE how to treat different components in your fluent interfaces for formatting.</p>
<p>Maybe annotations on the type of object produced by the section() operation, informing the compiler to treat this object as a "code block", in terms of the indentation. And perhaps an annotation on the .end() method telling the IDE to consider it the end of the "code block".</p>
<p><img src="http://blog.xebia.com/wp-content/uploads/2009/06/test1.png" alt="test1" title="test1" width="586" height="380" class="alignnone size-full wp-image-2239" /></p>
<p>So, this is to you Geertjan: I solemnly swear to erase Eclipse from my hard disk, as soon as something like this gets implemented in NetBeans. Maybe an annotation based approach is what is needed, maybe it isn't; I don't really care, as long as something gets done about this.</p>
<p>(If anyone else has some thoughts on this, I'd be happy to hear about it.)</p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/06/26/open-letter-to-geertjan-wielenga/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Pecia: Towards a Fluent Interface for Building Documents</title>
		<link>http://blog.xebia.com/2009/06/25/fluent-interface-for-documentation/</link>
		<comments>http://blog.xebia.com/2009/06/25/fluent-interface-for-documentation/#comments</comments>
		<pubDate>Thu, 25 Jun 2009 07:03:03 +0000</pubDate>
		<dc:creator>Wilfred Springer</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.xebia.com/?p=2202</guid>
		<description><![CDATA[<div class="diggthisplugin" style="float: right; width: 140px; padding-top: 10px; margin-left: 20px;"><iframe src="http://digg.com/tools/diggthis.php?u=http://blog.xebia.com/2009/06/25/fluent-interface-for-documentation/&s=compact&t=Pecia: Towards a Fluent Interface for Building Documents&k=#FFFFFF" scrolling="no" style="border: none; height: 18px; width: 120px;"></iframe>
		</div><h3 class="title">1. Introduction</h3>
<p>There is a chance that - after having read this article - you conclude that nobody in a sane state of mind would ever use what this article is going to suggest. Let me therefore start with disclaimer: I have never made any public claims regarding my state of mind.</p>
<p>Apart from that, I figure an article about a technology almost nobody is using, is still <span class="emphasis"><em>way</em></span> more interesting than an article <span class="emphasis"><em>everyone</em></span> is using. In fact, I guess the more senior you are, the more things you have already seen before, the less likely it is you will be you will be interested in something people already did many times before. Based on that, you might as well say that the most experienced people around will probably be interested in stuff that nobody is using. This article is for those people.</p>
<p>With that out of the way: Pecia is a new way of generating documentation from your Java applications. You will probably wonder why we need yet another way of generating documents from Java, and I have to admit that the Java world is not in a bad shape if it comes down to the number of frameworks allowing you to generate documents. However, Pecia takes another stab at it, and I just had to see if it would work. You be the judge whether it makes sense.</p>
<p><span id="more-2202"></span></p>
<h3 id="d4e20" class="title">2. Background</h3>
<p>I clearly remember the day on which I spend more time than I am willing to admit on finding out why my Maven report was not producing well-formed HTML. Definitely not one of my finest moments.</p>
<p>Maven reports are built using an API called Doxia. It's not a general purpose template-based text generating mechanism, like Velocity, FreeMarker or StringTemplate. In fact, there is no template at all. Instead, Doxia provides an API for building a document, abstracting the final representation of that document.</p>
<p>In a way, the most important interface in Doxia is the Sink interface. The <span class="interface">Sink</span> interface is basically the builder interface. It has operations to start the document body, to start a section, to generate text, to start a table, a table row, a table cell, and many other document chunks, and then a slew of operations to finish all of those. <a class="xref" title="Example 1. Doxia Sink usage" href="#example-doxia">Example 1, “Doxia Sink usage”</a> shows an excerpt of some of my code using Doxia.<br />
<a name="example-doxia"></a></p>
<p class="title"><strong>Example 1. Doxia Sink usage</strong></p>
<pre class="brush: java;">
sink.body();
sink.sectionTitle1();
sink.text(&quot;Message Catalog&quot;);
sink.sectionTitle1_();
sink.table();
sink.tableRow();
sink.tableHeaderCell();
sink.text(&quot;Type&quot;);
sink.tableHeaderCell_();
sink.tableHeaderCell();
sink.text(&quot;Identifier&quot;);
sink.tableHeaderCell_();
sink.tableHeaderCell();
sink.text(&quot;Message&quot;);
sink.tableHeaderCell_();
sink.tableRow_();
...
</pre>
<p><br class="example-break" />Unmistakeably, there is a correspondence between the Sink interface and a subset of the HTML content model. However, HTML is not the only type of document that can be generated from Doxia. In fact, by abstracting the interface from the implementation, Doxia is capable of generating any type of output document. Because of that, it generates PDF just as easy as HTML.</p>
<p>Now, there is one limitation in the Doxia approach: you are never really sure if your code is building a <em>valid</em> document. The API does not prevent you from adding an image to a table row, or inserting text in a table row outside of a table cell. And since Doxia aims to abstract you from the target representation, it's quite hard to make any assumptions on what is or is not considered to be valid.</p>
<p>This was in fact the reason why I spend so much time in completing my Maven report. It turned out I was 'building' the wrong type of document elements at the wrong time. The API did not prevent me from doing it, and the framework did not warn me at runtime. Which made me wonder....</p>
<h4 id="d4e34" class="title">3. Pecia</h4>
<p>It made me wonder if it would not be possible to enforce validation at compile time. Would it not be much nicer to have the <span class="emphasis"><em>API</em></span> prevent me from adding images to table rows, or - for instance - from adding a fourth table cell to a three-column table? Would it not be much nicer if the API would prevent me from making <span class="emphasis"><em>any</em></span> mistakes like these? And would it not be great if the API would be a <em>fluent</em> API<sup>[<a name="d4e40"></a>]</sup>?</p>
<p>From my perspective, the answer to all of these questions was: yes, that would be much nicer. It would allow me to spot errors quickly, and moreover, it would make my IDE's code completion assistance actually become valuable. With Doxia's Sink interface, your IDE will offer you the choice of adding images to a table row. In a framework that enforces proper document structure through its API, your IDE would never consider that to be a viable option.</p>
<h3 id="d4e44" class="title">4. Pecia API Principles</h3>
<h4 id="d4e46" class="title">4.1. Context-based</h4>
<p>The API offered by Pecia depends on on your context. This is probably best explained with an example.<br />
<a name="example-api-usage"></a></p>
<p class="title"><strong>Example 2. Pecia API Usage</strong></p>
<pre class="programlisting">Article article = ...;
ItemizedList list = article.itemizedList();
list.item("first item");
list.item("second item");</pre>
<p><br class="example-break" />So what happens in the example above? Well, first you create the target document. More on that later. From that point on, you can add content to the Article. Once you add an itemized list, the API returns an ItemizedList instance, an object representation of that new context. If you want to add content to the itemized list, you need to invoke operations on the ItemizedList instance. In this case, the example adds two items to the itemized list.</p>
<p>Just like Doxia, Pecia is also backed by a number of implementations of the API. At this stage, it supports both DocBook and HTML output. Given the sample code given above, a simple HTML implementation would generate HTML output like this:</p>
<pre class="programlisting">&lt;html&gt;
  &lt;body&gt;
    &lt;ul&gt;
      &lt;li&gt;first item&lt;/li&gt;
      &lt;li&gt;second item&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/body&gt;
&lt;/html&gt;</pre>
<h4 id="d4e55" class="title">4.2. Method Chaining</h4>
<p>Now, <a class="link" title="Example 2. Pecia API Usage" href="#example-api-usage">the example given above</a> is <span class="emphasis"><em>not really</em></span> illustrative for the way you would write code with Pecia. With method chaining, you can create the document without declaring variables to hold every intermediate content model element, and create a interface that bears a greater similarity with the way you would normally create documents in markup languages in, say, HTML or DocBook. Let us just say, a more <em>fluent</em> interface. So instead of writing the code given above, in the previous chapter, you can write code like this:<br />
<a name="d4e61"></a></p>
<p class="title"><strong>Example 3. Method chaining</strong></p>
<pre class="programlisting">Article article = ...;
article.itemizedList()
  .item("first item")
  .item("second item");</pre>
<p><br class="example-break" /></p>
<h4 id="d4e64" class="title">4.3. Shorthand Notations</h4>
<p>This is another sample document:<br />
<a name="d4e67"></a></p>
<p class="title"><strong>Example 4. Mixed 'Expanded' and Shorthand Notations</strong></p>
<pre class="programlisting">article
  .author("Wilfred Springer")
  .copyright("agilejava.com", 2008)
  .para()
    .text("This is the ")
    .emphasis("first")
    .text(" paragraph.")
  .end()
  .para("And this is the second.")
.end()</pre>
<p><br class="example-break" />Which will generate something similar to this:</p>
<pre class="programlisting">&lt;html&gt;
  &lt;body&gt;
    &lt;p&gt;This is the &lt;em&gt;first&lt;/em&gt; paragraph.&lt;/p&gt;
    &lt;p&gt;And this is the second.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;</pre>
<p>The important principle illustrated here is that Pecia both has shorthand notations as well as more verbose notations for specifying content. The simple para(String) operation (illustrated by the second paragraph in the example) starts the paragraph, adds text to it and closes the paragraph. So, it basically expands to this:</p>
<pre class="programlisting">.para()
  .text("And this is the second.")
.end()</pre>
<p>The principle does not only apply to paragraphs. It also applies to other document elements, such as list items, table cells and footnotes. In all of these cases, you can add that document element using a simple operation accepting a String with the text to be embedded within that document element, <span class="emphasis"><em>or</em></span> by calling an operation <span class="emphasis"><em>without</em></span> any arguments, which will change the context into the context of that document element.</p>
<p>Let's take an API snippet as an example. <a class="xref" title="Example 5. Pecia API Snippet" href="#example-adding-footnotes">Example 5, “Pecia API Snippet”</a> shows the signature of some operations on Para, the interface implemented by paragraphs. As you can see, there are two different footnote operations. They are different in a number of ways.</p>
<p>First of all, the first one takes a String argument, and the second does not. The first operation will create footnote, add a paragraph, and add text to the paragraph in a single call. Once it is done, the entire footnote is considered to be done. The context is no longer the footnote that was just added the paragraph. The context is - again - the paragraph itself.<br />
<a name="example-adding-footnotes"></a></p>
<p class="title"><strong>Example 5. Pecia API Snippet</strong></p>
<pre class="programlisting">interface Para&lt;T&gt; {
  ...
  Para&lt;T&gt; footnote(String text);
  Footnote&lt;? extends Para&lt;T&gt;&gt; footnote();
  ...
}</pre>
<p><br class="example-break" />The other footnote operation does <span class="emphasis"><em>not</em></span> take a String argument. The API assumes you are not interested in adding an empty footnote (why would you?), and changes the current context into the context of the footnote. From that point on, you can only invoke operations defined by the Footnote interface, until you finally consider yourself to be done with the footnote and call its end() operation, which will restore the original context.</p>
<h4 id="d4e85" class="title">4.4. Tables</h4>
<p>Tables deserve some special attention. In order to preserve a valid document structure, you not only want to restrict table cells to table rows; you also need to make sure that every row contains <span class="emphasis"><em>exactly</em></span> the same number of cells.</p>
<p>Enforcing this property of tables proved to be challenging. Before going into details, let us first look at an example:<br />
<a name="example-table"></a></p>
<p class="title"><strong>Example 6. A table in Pecia</strong></p>
<pre class="programlisting">article.table2Cols()
  .header()
    .entry().para("col1")
    .entry().para("col2")
  .end()
  .row()
    .entry().para("foo")
    .entry().para("bar")
  .end()
  .row()
    .entry().para("foo")
    .entry().para("bar")
  .end()
.end();</pre>
<p><br class="example-break" /><a class="xref" title="Example 6. A table in Pecia" href="#example-table">Example 6, “A table in Pecia”</a> illustrates how to build a table that more or less corresponds to this HTML table:</p>
<pre class="programlisting">&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;&lt;p&gt;col1&lt;/p&gt;&lt;/th&gt;
    &lt;th&gt;&lt;p&gt;col2&lt;/p&gt;&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;p&gt;foo&lt;/p&gt;&lt;/td&gt;
    &lt;td&gt;&lt;p&gt;bar&lt;/p&gt;&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;p&gt;foo&lt;/p&gt;&lt;/td&gt;
    &lt;td&gt;&lt;p&gt;bar&lt;/p&gt;&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;</pre>
<p>So what exactly is happening in <a class="xref" title="Example 6. A table in Pecia" href="#example-table">Example 6, “A table in Pecia”</a>? Well, first the table2Cols() operation constructs a table of two columns. The object getting created will allow only operations on tables of two columns.</p>
<p>The first thing we do after that, is adding a table header, by calling header() on the table. Since it is a two column table, the header accepts only two cells. Any attempt to add more or less then those two cells will give compilation errors.</p>
<p>Every table cell is getting constructed by calling entry(). The resulting context is a table cell. There are a number of things you can add to a table cell, such as paragraphs. Once you are done with the cell, you either call entry() or end(). Calling entry() will create the next table cell. Calling end() will mark the end of the current table header. And because of the way Pecia has been constructed, you can only call end() after the last table cell, and only call entry() before the last table cell.</p>
<p>Table header are added in <span class="emphasis"><em>exactly</em></span> the same way as table rows; only in this case you call row() instead of header().</p>
<h4 id="d4e102" class="title">4.5. Metadata</h4>
<p>Some document elements can have metadata associated to it; it often involves data that is not necessarily part of the main document flow. In cases like those, Pecia allows you to specify metadata at the start of the document element to which it is pertaining.</p>
<p>Let us take an article as an example. An article can have an author. At the beginning of an article, before adding any content to the article, you can add metadata like the author's name. Once you have started adding content to the article, it is impossible to add any more metadata. <a class="xref" title="Example 7. Article metadata" href="#example-article-metadata">Example 7, “Article metadata”</a> shows you a valid way of using it. <a class="xref" title="Example 8. Illegal article metadata" href="#example-invalid-article-metadata">Example 8, “Illegal article metadata”</a> illustrates an invalid way of specifying metadata; the compiler will <span class="emphasis"><em>not</em></span> accept any more metadata <span class="emphasis"><em>after</em></span> content has been added to it.<br />
<a name="example-article-metadata"></a></p>
<p class="title"><strong>Example 7. Article metadata</strong></p>
<pre class="programlisting">article
  .author()
    .firstname("Wilfred")
    .surname("Springer")
  .end()
  .para("This is the first paragraph.");</pre>
<p><br class="example-break" /><br />
<a name="example-invalid-article-metadata"></a></p>
<p class="title"><strong>Example 8. Illegal article metadata</strong></p>
<pre class="programlisting">article
  .para("This is the first paragraph.")
<span class="emphasis"><em>  .author()
    .firstname("Wilfred")
    .surname("Springer")
 .end()</em></span></pre>
<p><br class="example-break" /></p>
<h3 id="d4e117" class="title">5. Using Pecia</h3>
<p>In the previous section, you have seen most of the basic principles behind Pecia. However, you have not really seen how you actually make sure that some output document is generated as a result. That was done deliberately. The important thing here is the API outlined above. How you actually get your hands on an actual implementation, and how that implementation will treat handle the documents you are building is implementation specific.</p>
<p>Fortunately, Pecia <span class="emphasis"><em>does</em></span> come with an implementation. So this is how you use the implementation:<br />
<a name="example-implementation"></a></p>
<p class="title"><strong>Example 9. Producing HTML</strong></p>
<pre class="programlisting">// The standard implementation uses a wrapper around STaX to
// produce XML documents.
XmlWriter writer = new StreamingXmlWriter(...);

// The DocumentBuilder will actually produce the output.
DocumentBuilder builder = new HtmlDocumentBuilder(writer);

// But if we are building documents, we need to have an
// implementation of the interfaces mentioned above. Let's wrap
// the DocumentBuilder in an Article implementation. (The
// second argument is the Article's title.)
ArticleDocument document =
  new DefaultArticleDocument(builder, "Example");

// ... and now we can build the document.
document
  .section("First section")
    .section("First subsection")
    .end()
  .end()
.end();</pre>
<p><br class="example-break" />The standard Pecia implementation will generate the output on the fly. Technically, there is nothing preventing you from creating the entire document in memory first, and generating output afterwards. So all of this is all just implementation. In fact, I suspect some significant changes in the implementation, somewhere in the future; use this implementation at your own risk.</p>
<h3 id="d4e126" class="title">6. Pecia State</h3>
<p>After having read the previous section, you probably already guessed that Pecia is not done yet. It <span class="emphasis"><em>is</em></span> usable, and it <span class="emphasis"><em>is</em></span> actually in use in one of my projects, but there is still work left to be done. So this article is in a way covers an alpha version of the API.</p>
<h3 id="d4e131" class="title">7. Pecia Document Object Model</h3>
<p>The document object model supported by Pecia today is fairly simple. In fact, it is probably <span class="emphasis"><em>way</em></span> too simple. Which is another reason why Pecia there is no 1.0 version of Pecia yet.<br />
<a name="fig-dom"></a></p>
<p class="title"><strong>Figure 1. Pecia Document Object Model</strong></p>
<p><img class="alignnone size-full wp-image-2209" title="Pecia DOM" src="http://blog.xebia.com/wp-content/uploads/2009/06/dom.png" alt="Pecia DOM" width="549" height="558" /></p>
<p><br class="figure-break" /><a class="xref" title="Figure 1. Pecia Document Object Model" href="#fig-dom">Figure 1, “Pecia Document Object Model”</a> provides a schematic overview of the document object model supported. The arrows denote potential containment relationships: a list item can contain paragraphs, sections can contain tables, lists, verbatim content and other sections, etc. The document elements supported are pretty self-explanatory. The only exception may be xref, which represents an internal reference to another part of the document.</p>
<h3 id="d4e142" class="title">8. Summary &amp; Conclusions</h3>
<p>In this article, I have tried to justify the creation of yet another framework for generating documentation. It grew out of unease with the existing solutions, and then turned to have a couple of interesting side effects. Pecia not only prevents you from breaking the document structure in the documents you generate <em>at compile time</em>, but also supports the IDE in preventing you to make these errors altogether, at coding time.</p>
<p>In fact, there are some benefits that I haven't even covered yet. They might be less tangible, but nevertheless, very real. I started to see that benefit when applying Pecia in a project where I had 40-something small objects floating around, each of which needed to be represented differently in my document, depending on the context.</p>
<p>In situations like those, many of the existing frameworks commonly in use for generating documents will force you have all of these 40-something objects expose their state to the outside world, in order to be able to bind to it from an external template.</p>
<p>However, this violates one of the most important principles of object orientation: the encapsulation principle. By having the objects expose their entire state to the outer world, you have actually increased the dependencies between the outer world and the object's implementation, and instead of defining behaviour as part of the object, the behaviour (how to represent itself) is externalized <span class="emphasis"><em>entirely</em></span>. Consequently, maintaining the templates becomes a nightmare; your templates need to have a deep understanding of the internals of all objects.</p>
<p>Doxia would already allow you to take a different approach: you could potentially define a common interface on all objects allowing each object to render itself using the Sink interface. However, how would you convey the context in which the content needs to be written? How would you make sure that your object understands that it needs to display itself as part of a paragraph. Or as a table? How do you make sure that the object does not write outside of the context expected by the calling program?</p>
<pre class="programlisting">void render(Sink sink) {
  // doh, how would I know if I am in a paragraph or table context
}</pre>
<p>In Doxia, there is no way of solving this. In Pecia, this is trivial. The common interface would simply define a single operation for each context in which the object needs to render itself:</p>
<pre class="programlisting">void render(Para&lt;?&gt; para) {
  // ah, I need to render it as part of a paragraph
  // end there is no way of writing outside of that context.
}</pre>
<p>Which is only to say: there does seem to be a case for frameworks like Pecia. True, the current incarnation of Pecia is still quite young, and there are definitely things that will change in due time, but I have come to believe that it has potential, and I hope to have convinced you about that too.</p>
<p>Pecia is currently hosted at SourceForge (http://pecia.sourceforge.net/).</p>
<p><sup>[<a name="ftn.d4e40"></a>] </sup>See <a class="ulink" href="http://www.martinfowler.com/bliki/FluentInterface.html" target="_top">http://www.martinfowler.com/bliki/FluentInterface.html</a>.</p>
   Bookmark]]></description>
		<wfw:commentRss>http://blog.xebia.com/2009/06/25/fluent-interface-for-documentation/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
