By Jeroen Borgers

Recently, I was called in by a company with a website in trouble. And because they make all their money on-line, it was evident that they really wanted to have the issue solved. The day I came in, the site had gone down about 5 times the last 24 hours. Because of this they got less traffic, which directly meant less revenue. The log files showed periods of long response times and OutOfMemoryErrors. Their questions were: Why do we get this behavior? and How do we fix it? My short answer turned out to be: because of too many loitering objects; and this can be fixed by not holding on to them in the HTTP session. I’ll explain.

Troubleshooting assignment
As often is the case in this kind of troubleshooting assignments, once you have found exactly what is wrong, you've almost solved it already. The difficulty is in doing the detective work of nailing down the root cause. Or differently put, finding the needle in the haystack.

The tool they already used and actually was new to me was VisualVM.
Visual VM About window
Figure 1. VisualVM About window

It is similar to jdk's jconsole in that you can visually look into the JVM. It actually has a plug-in architecture by which for instance parts of jconsole are plugged into visualVm to see heap and thread usage, loaded classes and MBeans. For lower overhead than jconsole, it however has less detail in sections like heap. It also has a profiler plug-in which is the Netbeans profiler. So you can actually CPU or memory profile local applications from it. And since the recent version 1.6.0_07 the JDK has this great tool bundled with it now. In case you cannot find it in the JDK: they have renamed it to jvisualvm.

VisualVM monitor example showing analysis of a JMeter process
Figure 2. VisualVM monitor example

VisualVM helped in that we could see in the generated stack dumps that threads were often waiting for some log4j lock. Reason being the log level was set to DEBUG. Setting it to INFO improved performance somewhat. However, in this case visualVM did not bring me the crucial facts. Reason for this was twofold. The first was the lack of detail I mentioned and the second was the type of tool, being one having to be a witness of the problem situation to be able to analyze it.

Finding the crucial information
The toolset that gave me the first half of the crucial information was the -Xloggc:gc.log flag combined with HPJmeter. This gives the opportunity to analyze after the facts. That night, traffic was rather low and the site did not go down, but it was very slow during some hours. The next day, the log file showed at the time when performance became bad:
15045.703: [GC 1786342K->1481712K(1856512K), 0.1639990 secs]
15053.886: [GC 1790320K->1489985K(1858880K), 0.1722250 secs]
15065.930: [GC 1800257K->1493876K(1859968K), 0.1601460 secs]
15066.090: [Full GC 1493876K->1467212K(1859968K), 3.5031310 secs]
15071.966: [Full GC 1708415K->1471029K(1859968K), 3.5039160 secs]
15078.215: [Full GC 1708277K->1476287K(1859968K), 3.5326720 secs]
15087.847: [Full GC 1708386K->1408698K(1859968K), 4.5038550 secs]

We see here the fall-over from doing mainly GC's (young space) to mainly Full GC's, so from short pauses to long pauses. Actually every 6 seconds or so, and a collection takes 3.5 s. So a lot of time is spent in gc. As you can see, however, some memory is still reclaimed. With HPJMeter I could see this in nice graphics to draw conclusions:
HPJmeter showing percentage of time spent in GC.
Figure 3. HPJmeter showing percentage of time spent in GC.

So, it became clear to me that the site was so slow because up to 95% of the CPU cycles were spent on garbage collection, which left only 5% to the application itself.

Need for more details
What however the reason for this excessive gc was, was not clear to me, yet. I needed more details on the memory usage. So, we switched on the -XX:+PrintGCDetails flag which gave us the second half of the crucial information.
Unfortunately, HPJMeter could not deal with the produced log format, but we got lines at the start of the problems like:
21270.421: [GC [PSYoungGen: 436938K->112249K(481664K)] 1808785K->1489945K(1879808K), 0.1790760 secs]
21277.559: [GC [PSYoungGen: 443385K->111206K(486528K)] 1821081K->1497781K(1884672K), 0.1815330 secs]
21284.659: [GC [PSYoungGen: 442406K->107401K(489536K)] 1828981K->1504035K(1887680K), 0.1780120 secs]
21284.837: [Full GC [PSYoungGen: 107401K->79553K(489536K)] [PSOldGen: 1396634K->1398143K(1398144K)] 1504035K->1477697K(1887680K) [PSPermGen: 79038K->79038K(262144K)], 3.5320210 secs]
21292.969: [Full GC [PSYoungGen: 331200K->83102K(489536K)] [PSOldGen: 1398143K->1398143K(1398144K)] 1729343K->1481246K(1887680K) [PSPermGen: 79047K->79047K(262144K)], 3.5302490 secs]
21302.663: [Full GC [PSYoungGen: 331200K->85665K(489536K)] [PSOldGen: 1398143K->1398143K(1398144K)] 1729343K->1483809K(1887680K) [PSPermGen: 79055K->79055K(262144K)], 3.4508870 secs]
21311.177: [Full GC [PSYoungGen: 330681K->21303K(489536K)] [PSOldGen: 1398143K->1398144K(1398144K)] 1728825K->1419447K(1887680K) [PSPermGen: 79065K->79027K(262144K)], 4.4884740 secs]

Again, here we see the fall-over. If you look at the old space: the first full gc puts objects from young in old making it almost full, there is only 1K or less of unused space left shown in the full GC lines. The confusion comes from the young space usage, where most garbage still can be collected. This makes the total heap not full, while the old space is at its limits. So, some things get stuck in old space. The remarkable thing is that the situation recovers after an hour or so and an OOME does not occur.

Heap analyses with Netbeans
So, now it is time to do more heap analyses. Netbeans has a great built-in profiler with a useful heap analyzer. We found that most of the memory was used by Category objects, which was not intended as such. By using the heap walker we could find that references to it were mostly held from Tomcat session objects. So, each HTTP session had its own copy of these Categories and they were kept hanging around for half an hour: the default HTTP session timeout. The term for this kind of objects is: loitering objects.

The quick and the real fix
At this point we could devise a quick remedy: decrease the HTTP session timeout from 30 to 5 minutes. This reduced most pressure on us fixers and gave us some air to breath. Then we created some tests and were able to reproduce the behavior in a test environment. Looking at jconsole also clearly showed the growing usage of old space.
JConsole showing memory usage for each memory pool
Figure 4. JConsole showing memory usage for each memory pool.

In the source code we could easily find the place where all category objects were loaded from the database, put in a cache once and, along the way, also a copy in each user HTTP session. For no reason really. And with 1700 user sessions this took most of the memory.

So, this was a no-brainer to solve in the code. And after deploying and setting back the session timeout to the default 30 minutes, problems were solved. I love it when a plan comes together.

Learn more
If you want to learn more about this stuff and get hands-on experience, come to my class of "Speeding up Java applications" in April or November 2009.