Getting the Java out of your Scala, part 2

I’m still trying to get rid of old habits, to shake of my winter hide, so to speak, and create some real Scala in stead of ScaVa (i.e. Java with a Scala syntax). If you’re interested you can bear witness to my struggle on GitHub (ShoppingList on GitHub). This story came about because I asked some colleagues for help. We ended up rewriting loops in several ways.
What I’ll show you is some alternatives to classic loops over collections.

The code is attached here. The example shows how to summarize items in a list. The objects in the list are all instances of class Stuff, a simple value container for a String and an Int. The idea is to summarize Stuffs with the same key to produce a new list:

  val items = List(Stuff("A", 1), Stuff("A", 2), Stuff("B", 3), Stuff("B", 4))

should become

  val expectedResult = List(Stuff("A", 3), Stuff("B", 7))

where Stuff is a simple case class with two attributes:

case class Stuff(val label: String, val number: Int)

Because I’m a long time Java programmer, my first solution looks like this:

  def classicSum = {
    var result: List[Stuff] = List()
    for (item <- items) {
      if (result.size > 0 && item.label == result.head.label) {
        result = Stuff(result.head.label, result.head.number + item.number) :: result.drop(1)
      } else {
        result = item :: result
    assertEquals(expectedResult, result.reverse)

Good old looping and a var to collect the result. This works, but it doesn’t say what’s happening. All you see is a loop and some fancy list processing. Note the call to drop(1) allowing you to replace the head of the list with a new instance. I like the drop function and I’ve used it in other programs but here it just obfuscates matters.

The next solution is to go recursive. If you check out the version of ShoppingList as of October/November 2011, you’ll find lots of recursion. My goal at the time was to jam recursion into my head by banning all other forms of looping.
Applied to the list of Stuff instances and the problem at hand, the result is tragic:

  def recursiveSum = { 
    @tailrec def sum(listOfPairs: List[Stuff], result: List[Stuff]): List[Stuff] = {
      listOfPairs match {
        case Nil => result
        case head :: tail =>
            val currentHead = if (result.size == 0) { Stuff(head.label, 0) } else { result.head }
            val newResult =
              if (currentHead.label == head.label) {
                Stuff(currentHead.label, currentHead.number + head.number) :: result.drop(1)
              } else { head :: result }
            sum(tail, newResult)
    val result = sum(items, List())
    assertEquals(expectedResult, result.reverse)

My next version leverages Scala’s collections to drastically reduce the amount of code. The first attempt passes the test but it is cryptic. In the spirit of the red-green-refactor mantra I’ll show it to you anyway:

  def sumByGroupSolution1 = {
    val groupedByLabel = items.groupBy(_.label)
    val result = groupedByLabel map { t => Stuff(t._1, t._2 map { _.number } sum) }
    assertEquals(expectedResult, result)

There! Go and parse that if you need to change this code sometime next year (note: having worked with this code for some time now while writing this blog, I must admit that it grows on you; after a while it doesn’t hurt all that much, sort of like a new pair of shoes). This cryptic piece of mal-ware illustrates the use of three powerfull methods named groupBy, map and sum.
groupBy is in a sense comparable to SQLs group by clause. The Scala version takes a collection and returns a Map. The map’s key is the element used to group by (Stuff.label in my case). The value is a List of elements that have the same key. In this case the items collection contains Stuff instances. I’ve grouped by the label field so the type of groupedByLabel is Map[String, List[Stuff]]. On this Map we have to apply a sum function to add up all Stuff instances in the List of Stuffs. Sum however works on Int’s, so before sum can be applied we have to extract the number field of each Stuff instance. This is done by a map: map {_.number}. The mapping is applied to each Stuff instance in the list of Stuffs that was returned by groupedByLabel.

By factoring out part of the code on line 3 in the example above, we can improve readability:

  def sumByGroupSolution2 = {
	val stuffsGroupedByLabel = items.groupBy(_.label)
    def sumOfStuffsWithTheSameLabel(stuffs: List[Stuff]): Int = stuffs map { _.number } sum
    val result = stuffsGroupedByLabel map { t => Stuff(t._1, sumOfStuffsWithTheSameLabel(t._2)) }
    assertEquals(expectedResult, result)

I think this is better because now the result of part of the computation is named. Naming a thing makes it easier to see the algorithm.

The next solution uses more intermediate results. The results of each map or sum operation are stored in a variable that gets a meaningful name. This clarifies the meaning of intermediate results but the amount of code grows.

  def sumByGroupSolution5 = {
    val calcSumOfStuff = (stuffs: Seq[Stuff]) => stuffs map { _.number } sum
    val groupedByLabel = items.groupBy { _.label }
    val resultGroupedByLabel = groupedByLabel mapValues { calcSumOfStuff }
    val result = resultGroupedByLabel map { t => Stuff(t._1, t._2) }
    assertEquals(expectedResult, result)

The code shows another Map function named mapValues.

 val resultGroupedByLabel = groupedByLabel mapValues { calcSumOfStuff }

mapValues works on the values of a map only, ignoring the keys. In my case the keys don’t really matter (they’re an artifact of the groupBy call) so I can get away with mapValues rather than map.

Naming things was also the inspiration for the next solution. Now I’m not extracting intermediate results but in this case I’ve named the variables being manipulated:

  def sumByGroupSolution4 = {
    val groupedByLabel = items.groupBy(_.label)
    val result = groupedByLabel map { case (label, stuffs) => Stuff(label, stuffs map { _.number } sum) }
    assertEquals(expectedResult, result)

The case statement as argument to the map makes it possible to introduce two variables to identify the things we’re manipulating. Label and stuffs mean more to me than _1 and _2.

Looking back on my struggle I like this solution best. I think it is both concise and easy to read.