Beware of transitive dependencies... For they can be old and leaky
In many JEE apps today, you almost cannot forgo XML. Whether it is in configuration, data structures or service interfaces, you will certainly use a number of XML files. In a recent project we had to deal with a number of external services which used an XML interface. Little did we know that we introduced a potential time-bomb in our application...
Each of the accessed XML services in our application provided data for the same domain. So instead of writing or generating a great number of Java classes, we decided to use XSLT to transform the XML into a common format we could easily unmarshall. During performance and stress tests, this solution performed very well and showed no signs of degradation.
However.... Once we hit production, the application went OutOfMemory after a few days on a heap of 1GB. This was bad, how come our performance and stress tests didn't show this? We decided to do the only sensible thing, and create a heapdump as soon as the application showed signs of stress. We started monitoring the java process using the jstat utility found in the JDK.
$ jstat -gcutil <processid> 1s 10000 S0 S1 E O P YGC YGCT FGC FGCT GCT 0.00 0.00 99.98 100.00 26.66 2896 119.886 6000 32234.945 32354.831 0.00 0.00 100.00 100.00 26.66 2896 119.886 6001 32234.945 32354.831 0.00 0.00 100.00 100.00 26.66 2896 119.886 6001 32234.945 32354.831 0.00 0.00 100.00 100.00 26.66 2896 119.886 6001 32234.945 32354.831 0.00 0.00 100.00 100.00 26.66 2896 119.886 6001 32234.945 32354.831 0.00 0.00 100.00 100.00 26.66 2896 119.886 6001 32234.945 32354.831 0.00 0.00 100.00 100.00 26.66 2896 119.886 6001 32234.945 32354.831 0.00 0.00 100.00 100.00 26.66 2896 119.886 6002 32240.277 32360.163
As soon as we saw the old space utilization hitting 100% and the Full Garbage Collector being unable to clean things up, we created a heapdump
$ jmap -dump:format=b,file=/tmp/heap.hprof <processid>
Using the Eclipse Memory Analyzer we looked into the heap dump, to see what was causing our memory to grow. This generated the following picture.
Clicking on the "Leak Hunter" link in the analyzer, quickly showed the possible culprit. The XMLReaderManager class from Xalan somehow had taken up a big chunk of available memory.
Searching through our Maven2 pom files showed no explicit dependency upon Xalan, but the dependency graph in the generated maven site showed it as a transitive dependency. One of our dependencies had xalan-2.6.0 noted down as a dependency. Googling around for this quickly showed us that there were two bugs against xalan, which described this exact problem: XALANJ-2178 and XALANJ-2195.
Looking at the source code for Xalan 2.6.0 we could indeed clearly notice the memory leak, whereas in 2.7.0 they updated it to prevent the class from keeping references around for too long. But why didn't our stress tests on a similar machine and configuration show this?
The answer for this can also be found in the XMLReaderManager. It tries to cache the XMLReader in a ThreadLocal, so that it can give the same XMLReader to the Thread if it requests a new one. However in a JEE system, you normally have threadpools which scale up and down depending on system load at a given time. In our stress tests, we only simulated full loads for a prolonged time, and with this load, the XMLReaders weren't leaked, as threads were being reused from the ThreadPool. If we had simulated a more varying load over a larger amount of time, we would have seen this occurring before we hit production.