Last time I blogged about the importance of continuous performance testing. When you write and run performance tests continuously, just like unit tests, you get early performance insights in new and changed features of your software. This will minimize surprises and be more productive. Now I’ll blog about monitoring and diagnostics.

When a new version of the software is released into the production environment, the question always is: will it actually perform like we saw in testing and acceptance environments? And we keep our fingers crossed.
It is therefore important in such cases to monitor carefully what happens with the performance and availability of the application.

There are all sorts of tools and services available to monitor your web site for availability and response times of web pages, like Uptrends, Site24x7 and Dotcom-monitor. They look at the application as a black box and measure once in several minutes. This is very useful, however, to be able to take the right measures in case of a calamity, it is necessary to be able to pin-point the problem. It is therefore essential to monitor on multiple levels and on multiple internal application parts. For levels, think of hardware, OS, app server, web server, database and application. Measuring internal Java application parts can be achieved with JAMon. JAMon is an open source timing API and basically works like a stopwatch with a start() and stop() call. Every method which you want to measure gets its own stopwatch (or counter) . We deal with JAMon as one of the tools to measure time in the first day of our Speeding up Java Applications course.

JAMon API start() and stop() calls in a Spring interceptor
Figure: JAMon API start() and stop() calls in a Spring interceptor

Each counter maintains statistics like the number of calls, average, maximum, standard deviation, etc. , and this information can be requested for. The individual calls are not stored. This approach results in low memory usage and a low performance overhead, at the cost of some information loss. Recently, a new competitor of JAMon appeared: Simon. It claims to be JAMon’s successor, although it has (had) some infancy issues.

Then there is the question: where to measure? The answer is that it makes most sense to measure all incoming calls like web requests and outgoing calls to for instance the database. Furthermore, parts like Spring beans, EJB’s and DAO’s. Measuring these parts is not only relevant with new releases, but also trends and usage spikes are useful to monitor in order to solve quickly and prevent various problems. Open source tool JARep offers the possibility to store JAMon data from a cluster in a database and monitor trends and changes graphically.

JARep shows the increasing response time trend starting October 15, on two of the four production JVMs.
Figure: JARep shows the increasing response time trend starting October 15, on two of the four production JVMs.

Customer story

We had the following situation at our customer. Processing an order slowly took more and more time over a period of several weeks. This happened while no new release was introduced and no other page became slower. This behavior was a complete mystery, until we looked deeper in our JARep monitoring tool. The troublemaker turned out to be a DAO executing a prepared statement with only part of the variables being bind-variables. With help of JARep, we could look back to where the trend of increasing response time started so when the problems started. We could also see that this problem was only present at one of the two machines. With this knowledge and his log book, the operator could remember that on the start date he had experimented with a new JDBC driver to try to solve a memory leak. This seemed not to change anything concerning performance, what actually was the case in the beginning. Problems only appeared slowly during the following weeks. They had left the new driver in place, manifesting itself as a time bomb later. When we put back the old driver, the problem disappeared. This real life experience shows the usefulness of monitoring and trend analyses on application internals.

Next time I'll blog about evidence based tuning.