Scala Lists: Watch your append and prepends

Today I came across a nice problem in a piece of code one of our colleagues sent around asking for improvements. The problem consists of grouping a large List of lines according to some characteristics. Once the List grew large, the function which took care of the grouping took over 2 minutes to group the dataset.

The problem

Let's first give a fictitious example of the domain and function that was wreaking havoc:

case class Container(lines: List[Contained])
case class Contained(key: String, value: String)

def group(ts: List[_]) = ts.foldLeft(List[Container]()) {
  case (l: List[Container], r: Contained) =>
    r match {
      case Contained("", "") => l
      case x @ Contained("", _) => {
        l.init ++ List(Container(l.last.lines ::: x :: Nil))
      }
      case x : Contained  => l ++ List(Container(List[Contained](x)))
      case _ => l
    }
  }

If the function was called on a suitably large list (100,000 items), execution time jumped to 130 seconds. Though the code is functionally sound, it isn't all that optimized for Scala Lists. That leaves us with a number of possible optimizations:

  1. Replace each List with ListBuffer, making it effectively a mutable collection of mutable objects
  2. Make the Container object mutable by using a ListBuffer
  3. Optimize the function for Lists

Because the first and second scenarios don't differ much, I'll just present the first and last one here.

Fully mutable solution

case class Container(lines: ListBuffer[Contained])
case class Contained(key: String, value: String)

def group(ts: List[_]) = ts.foldLeft(ListBuffer[Container]()) {
  case (l: ListBuffer[Container], r: Contained) =>
    r match {
      case Contained("", "") => l
      case x @ Contained("", _) => {
        l.last.lines.append(x)
        l
      }
      case x : Contained  => l.append(Container(ListBuffer[Contained](x))); l
      case _ => l
    }
}

The execution speed over 100,000 items for this solution is around 350ms, quite an achievement by just using mutable objects.

Optimizing the immutable function

The first two options improve the runtime speed considerably, but both make the objects mutable. This is an evil we can avoid if we take a closer look at the group function and the List operations it's using. The two main operations the group function is using are: List.init and List.last. Let's take a look at the operations on List and their complexity:

List.head O(1)
List.tail O(1)
List.last O(n)
List.init O(n)
List.reverse O(n)

We can easily rewrite the group function to use head and tail instead of last and init. Each is the other's counterpart on a reversed list, ie.

list.init == list.reverse.tail.reverse
and
list.last == list.reverse.head

If we just immediately build the reverse lists, we just need to reverse once at the end. And we've saved a lot of O(n) operations. Let's have a look at the optimized version, without the final reversing:

def improvedGroup(ts: List[_]) = ts.foldLeft(List[Container]()) {
  case (l: List[Container], r: Contained) =>
    r match {
      case Contained("", "") => l
      case x @ Contained("", _) => {
        List(Container(List(x) ++ l.head.lines)) ++ l.tail
      }
      case x : Contained  => List(Container(List[Contained](x))) ++ l
      case _ => l
    }
  }

The only remaining problem now is that the List[Container] and the lines of the Container are both reversed, so in order to get the same result, the function needs to be adapted a bit more to:

def fastGroup(ts: List[_]) = improvedGroup(ts).reverse.map(x => Container(x.lines.reverse))

Now the fastGroup function delivers the same results, and its runtime speed over 100,000 items is around 450ms.

I don't know about you, but I prefer the last solution. Though you need to remember to put the elements in the right order again after the grouping, you don't sacrifice immutability. It sometimes pays off to work towards a solution one step at a time, instead of trying to craft it all at once.

Comments (6)

  1. Wilfred - Reply

    April 4, 2012 at 12:22 pm

    What is it actually supposed to be doing? Can you explain that to me in just a few lines?

  2. Tony Morris - Reply

    April 4, 2012 at 12:27 pm

    As a general rule, if you are doing any of the following:
    * appending to a list
    * reversing a list
    * [init]ing a list
    * [last]ing a lit
    ...you are not using an appropriate data structure.

    As for a data structure with O(1) cons and snoc, see DList (difference list) in Scala. It is isomorphic to EndomorphismT[Trampoline, List[A]]. In short, it is equivalent to List[A] => List[A], but with trampolining to use heap instead of stack. It is important to do this when using scala.List in general, so careful of stack usage in other contexts.

  3. Jeroen van Erp - Reply

    April 4, 2012 at 12:28 pm

    Hi Wilfred,

    I can explain that. What goes in the function is a large List of Contained objects. These need to be grouped into different Containers. The rule is that the Contained objects are in order when coming in. Every Contained can either be:
    1. Empty (empty strings in key and value) --> Ignored
    2. A continuation (empty string in key)
    3. A new Container (non-empty string in key)
    4. Something else --> Ignored

    So long as you see Continuations, these need to be added to the last Container that was added.

    Does that make it clearer?

  4. Jean - Reply

    April 4, 2012 at 1:22 pm

    Thanks wilfred for asking the question and Jeroen for answering it, even better would have been a unit test demonstrating the inputs and outputs 😉

  5. Viktor Klang - Reply

    April 4, 2012 at 6:40 pm

    Use scala.collection.immutable.List wherever you'd use a Stack.

  6. Dhananjay Nene - Reply

    April 5, 2012 at 8:03 am

    Here's another version which does not use head/tail. It does use reverse for the sake of ordering. Not sure how it's performance is with respect to the improvedGroup + fastGroup functions


    def group(ts: List[Contained]) = {
    val (list, buffer) = ts.foldLeft((List[Container](),List[Contained]())) {
    case (pair, r: Contained) =>
    val (l,b) = pair
    r match {
    case Contained("", "") => (l,b)
    case x @ Contained("", _) => (l, x :: b)
    case x : Contained => if (b isEmpty) (l, List[Contained](x)) else (Container(b reverse) :: l, List[Contained](x))
    case _ => (l,b)
    }
    }
    ((Container(buffer reverse) :: list) reverse)
    }

Add a Comment