How the quest for transaction timeout’s did cost me money

Kris Geusebroek

At our project the focus is at making the application stable and controllable. So instead of building cool new features
we are spending our time making sure the application is able to run stable in the production environment.

After the first few issues the so called 'Transaction timeout' issue raised it's ugly head.
Every now and then the application threw an exception due to a transaction timeout.
This was very strange since the timeout was set to 30 seconds and the complete processing of the whole
application was done in less than 2 seconds (spread over more than 1 transaction).

The first thing to try was to setup a production like environment to try and reproduce this.
We spent a few hours to make our development environment as much production like as we could.
With a lot of stubbing on the outside boundaries of the system and a live feed of data to process, our
test was started on Friday afternoon.

We where very surprised to see 0 timeouts when we arrived at the office on Monday morning.
On to plan B. The original application consists of 4 parts so the plan was to strip our environment to only run a small part of the original application.
We also changed the transaction timeout to 1 second to raise the chances of seeing timeouts.
This way we might predict in which part of the application the search needed to be continued.

1 != 1

The title of this paragraph might sound a bit strange but hold on, an explanation will follow soon.

After our testrun there were some transaction timeouts. We also noticed that this part of the application did the normal processing in about 13 milliseconds.
Fast enough to beat the timeout you would think.
But then we discovered that the reason for these timeouts was not the same as the timeouts we experienced in our live environment.

The reason we got timeouts in this setup was because of a somewhat different meaning of the timeout setting than we expected.
We expected the timeout monitor to start when the transaction started and raise an exception after 1 second. But.....

Let's take a look at the code

in the class org.hibernate.jdbc.AbstractBatcher there are two methods involved

public void setTransactionTimeout(int seconds) {
    isTransactionTimeoutSet = true;
    transactionTimeout = System.currentTimeMillis() / 1000 + seconds;
}

and

private void setTimeout(PreparedStatement result) throws SQLException {
    if ( isTransactionTimeoutSet ) {
        int timeout = (int) ( transactionTimeout - ( System.currentTimeMillis() / 1000 ) );
        if (timeout<=0) {
            throw new TransactionException("transaction timeout expired");
        }
        else {
            result.setQueryTimeout(timeout);
        }
    }
}

If you set the transaction timeout to 1 second, Hibernate raises the timeout exception at the switch of the second exactly.
When you start your transaction at 13:45:21.990 the exact timeout is 10 milliseconds! No wonder we saw these timeouts.
So if you want the lower boundary of your transaction timeout to be 1 second you need to set the Hibernate transaction timeout setting to 2 seconds!

And yes, after setting it to 2 seconds, no more timeouts!

Silly idea!

During the testing I stumbled upon some threaddumps (made for a different reason). The strange thing I noticed was a lot of threads waiting for a lock on apache log4j related classes.
We do some extensive logging in our application so this wasn't completely strange.
But somewhere in my gray matter a connection was made between this phenomena and the transaction timeout we were investigating.

So I shared my thoughts with the team. I told them that I suspected that the logging done in the application might take such a long time that the transaction was not finished within the 30 seconds.
And then I was laughed at! How on earth could something as fast as writing a line to a logfile take this long.

This silly idea triggered some more silly ideas (although not as silly as mine), thus the team bet was born!

Team bet

The team bet is a bet that emerged inside the team. All ideas about a problem, which is investigated inside the sprint, are put on the whiteboard. People can come up with their own, or support the idea of another team member.
The only strange thing is that the winner (and his/her supporters) need to pay for the cake. Eternal fame is the price you win of course.

For a few days my silly idea was on the whiteboard with the sole support of the tester in the team.
Because it was the most silly idea on the board, the other ideas where investigated first.
But after a few days the only thing left was this silly idea. In the days that went by my feeling for this silly idea got stronger.

In my spare time I went through the code and pinpointed the strange behavior to the log rotate functionality.
Whenever a new logfile was created the oldest one needed to be removed to keep the directory clean.
But a few tests with a large amount of files showed that this was done in under 30 milliseconds on my local system.

Pigheaded me

Pigheaded as I am, I was still convinced that this was the problem. Because at this time everybody from the team was willing to prove me wrong we decided to
empty the logfile directory in the production environment (because that was the only place where the timeouts were experienced).

Before this action we experienced about 10 to 15 timeouts per hour. And yes you guessed right, after emptying the log directory 0 timeouts for a period of several weeks.
So this was beginning to sound like a duck, but to be sure it was a duck we needed to do a reverse test.
But hey! don't touch a good working system (because that is what we created by removing the large amount of files), so we weren't allowed to experiment anymore in our live environment.

And because all was running smooth regarding the transaction timeouts, our focus switched to the next most important issue.
To keep a little progress on the transaction timeout issue we asked the testers who do the acceptance testing to do this reverse test for us.

And after a day we got the final word: transaction timeouts were caused by the large amount of logfiles in the directory combined with a log rotate and cleanup of the oldest file.

End result of this quest was that me and my supporter took off to the nearest bakery to buy cake for the whole department.

And this is why the quest had cost me money and gave me eternal fame in return.
And this also makes me wonder about how free open source software actually is.....

Comments (5)

  1. Peter Veentjer - Reply

    February 14, 2011 at 11:42 pm

    He Chris,

    nice blogpost and fun to read 🙂

  2. [...] This post was mentioned on Twitter by Jan Plas, Kris Geusebroek. Kris Geusebroek said: Just blogged about how a transaction timeout quest did cost me money http://bit.ly/f3XXlJ [...]

  3. Wilco Koorn - Reply

    February 15, 2011 at 7:08 am

    He Kris,

    I am glad to be on your team, and am glad to be wrong 😉
    Excellent find!

  4. Friso van Vollenhoven - Reply

    February 21, 2011 at 10:07 am

    Is the logging dir on a automounted NFS? Doing a simple ls on those takes forever, so it would make sense that it takes very long to find out which file to delete...

Add a Comment