Beware of transitive dependencies... For they can be old and leaky

In many JEE apps today, you almost cannot forgo XML. Whether it is in configuration, data structures or service interfaces, you will certainly use a number of XML files. In a recent project we had to deal with a number of external services which used an XML interface. Little did we know that we introduced a potential time-bomb in our application...

Each of the accessed XML services in our application provided data for the same domain. So instead of writing or generating a great number of Java classes, we decided to use XSLT to transform the XML into a common format we could easily unmarshall. During performance and stress tests, this solution performed very well and showed no signs of degradation.

However.... Once we hit production, the application went OutOfMemory after a few days on a heap of 1GB. This was bad, how come our performance and stress tests didn't show this? We decided to do the only sensible thing, and create a heapdump as soon as the application showed signs of stress. We started monitoring the java process using the jstat utility found in the JDK.

$ jstat -gcutil <processid> 1s 10000
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT   
  0.00   0.00  99.98 100.00  26.66   2896  119.886  6000 32234.945 32354.831
  0.00   0.00 100.00 100.00  26.66   2896  119.886  6001 32234.945 32354.831
  0.00   0.00 100.00 100.00  26.66   2896  119.886  6001 32234.945 32354.831
  0.00   0.00 100.00 100.00  26.66   2896  119.886  6001 32234.945 32354.831
  0.00   0.00 100.00 100.00  26.66   2896  119.886  6001 32234.945 32354.831
  0.00   0.00 100.00 100.00  26.66   2896  119.886  6001 32234.945 32354.831
  0.00   0.00 100.00 100.00  26.66   2896  119.886  6001 32234.945 32354.831
  0.00   0.00 100.00 100.00  26.66   2896  119.886  6002 32240.277 32360.163
 

As soon as we saw the old space utilization hitting 100% and the Full Garbage Collector being unable to clean things up, we created a heapdump

$ jmap -dump:format=b,file=/tmp/heap.hprof <processid>

Using the Eclipse Memory Analyzer we looked into the heap dump, to see what was causing our memory to grow. This generated the following picture.

Clicking on the "Leak Hunter" link in the analyzer, quickly showed the possible culprit. The XMLReaderManager class from Xalan somehow had taken up a big chunk of available memory.

Searching through our Maven2 pom files showed no explicit dependency upon Xalan, but the dependency graph in the generated maven site showed it as a transitive dependency. One of our dependencies had xalan-2.6.0 noted down as a dependency. Googling around for this quickly showed us that there were two bugs against xalan, which described this exact problem: XALANJ-2178 and XALANJ-2195.

Looking at the source code for Xalan 2.6.0 we could indeed clearly notice the memory leak, whereas in 2.7.0 they updated it to prevent the class from keeping references around for too long. But why didn't our stress tests on a similar machine and configuration show this?

The answer for this can also be found in the XMLReaderManager. It tries to cache the XMLReader in a ThreadLocal, so that it can give the same XMLReader to the Thread if it requests a new one. However in a JEE system, you normally have threadpools which scale up and down depending on system load at a given time. In our stress tests, we only simulated full loads for a prolonged time, and with this load, the XMLReaders weren't leaked, as threads were being reused from the ThreadPool. If we had simulated a more varying load over a larger amount of time, we would have seen this occurring before we hit production.

Comments (3)

  1. Markus Kohler - Reply

    September 16, 2008 at 4:08 pm

    Hi,
    you don't really need to trigger a heap dump manually in case of an out of memory error.
    You can just use the option
    -XX:+HeapDumpOnOutOfMemoryError and a heap dump will created automatically.

    Regards,
    Markus

  2. Jeroen van Erp - Reply

    September 17, 2008 at 9:03 am

    Hi Markus,

    This is true indeed, except that we didn't want to let the machine get this far as it was a production machine. once it it the top, we needed to restart it asap, and not let it get too slow. That's why we choose for doing a manual dump 🙂

    Regards,
    Jeroen

  3. Pieter vd Meer - Reply

    April 28, 2009 at 10:12 am

    Hmm,

    Although i agree that transitive dependencies can be hell sometimes. I don't think maven or the library at stake is to blame here.

    The problem is in the testing, when doing a load/stress test it should resemble the actual load on the system.
    Running a test that does the same thing 1M*1M*1M times is not a test. Just something to see if the result is always the same, and trust me as long as you are not doing creative FP calculations it will be the same.

    The advice is, run your load and stress test to simulate the load the system will actually recieve, change the request data!

    Pieter

Add a Comment