The "Performance Series" Part 1. Test Driven Performance.

Wilco Koorn

A number of my colleagues and myself recently decided to share our knowledge regarding "performance" on this medium. You are now reading the first blog in a series in which I present a test-driven approach to ensuring proper performance when we deliver our project.

Test driven

First of all note that "test-driven" is (or should be 😉 common in the java coding world. It is, however, applied to the unit-test level only: one writes a unit test that shows a particular feature is not (properly) implemented yet. The test result is "red". Then one writes the code that "fixes" the test, so now the test succeeds and shows "green". Finally, one looks at the code and "refactors" the code to ensure aspects like maintainability and readability are met. This software development approach is known as "test driven development" and is sometimes also referred to as "red-green-refactor".

Test driven performance

Now let us see what happens when we try to apply "test-driven" to a non-functional requirement like "performance". Obviously, we need a test and the test result needs to be "red" or "green". There are many aspects in the "performance" area, so let us take one for the sake of our story here: we assume we are building a web-based application and look at its response times. Now our test can be something like "the mean response time of the system when responding to URL such-and-such must be lower than 0.4 seconds. I personally find such a requirement highly interesting as it is time-related! These kind of non-functional requirements are usually given for the final result of the project. But what about during the project?

Test criteria during a project

My claim is that during a project the criteria of non-functional requirements should be changed. Response times of the system should be extremely good at project start as there is hardly any system at all! At the end of the project when almost all development work is done response time only has to be "good enough". Therefore the criteria should be planned for example by using a picture like this:

Figure 1. Planning a mean response time criterion during a project

What happens when we "break the build"?

During development, we constantly run our test by for instance using a tool like JMeter. We collect mean response times of critical URLs and see if we adhere to the criterion level of the day. One day we "break the build": we do not meet our criterion, as the test is "red". Now what? For me this is even more intriguing than the flexible criteria we saw above. In test-driven software development one usually stops all development when the "build is broken". All tests must show green. In our case my strong advise is: don't act now, plan a performance tuning activity! During such an activity we tune the system until the test is "green" again. So our failing response time test triggers a planning activity rather than triggers immediate action to fix the problem.

Preventing waste

Suppose we have planned a performance tuning activity, as our test is "red". How much work do we have to do? How do we minimize the amount of work? Or in other words, how do we prevent waste? If we tune the system such that the test just show "green" there is a good chance it turns "red" next week and we have to introduce a performance tuning activity again. This does not make sense. On the other hand when we optimize way beyond the "green" criterion we tend to do too much work.

The solution is simple: use a lower limit! So when we do not meet the "green" criterion of, say, 0.2 seconds at a given time we optimize until we have reached a 0.15 second response time and then stop optimizing. This leads to a performance planning like this:

Figure 2. Planning a mean response time during a project while preventing waste

Test driven performance in an Agile perspective

Of course the initial performance-planning figure is a very wild guess. There is nothing wrong with such a guess! It is the best we know at that moment. During the project we of course adapt our performance planning. The key thing here is that we constantly attend to system response time as we always have a test at hand showing us "red" or "green".

Pros and cons

There are two major advantages of the approach sketched above. Obviously, we catch ill design decisions leading to bad response times in an early stage. Therefore project management is in control rather than in the hands of a major project risk, as we are no longer confronted by a bad-performing system in late stages of the project. Secondly, we prevent waste during optimizations due to using a lower limit.

As a possible disadvantage, our approach might very well be more expensive compared to an approach where we inspect the behavior of a system in production only, and rely on quick reactions to fix any issues. My colleague Adriaan Thomas will zoom into this aspect in the next blog of this series.

Comments (1)

  1. Dave Collier-Brown - Reply

    October 11, 2012 at 5:59 pm

    My experience is that doing performance tests as part of the normal
    functional testing pays off in the same way writing traditional tests
    does. As soon as you're green, you stop trying to make it faster.

    Actually executing the tests is quite easy: you need two, both done with
    something like JMeter. One tests for code-path speed, the second for
    scalability.

    Lets take a simple example, a web service that has to return an answer
    in 1/10 of a second for any load up to 10 users on the one-processor
    wimpy little machine that we do our nightly build on. The 0.1 second
    response time will be your "red" line value for production.

    We set a budget of, for example, 0.08 seconds for the middleware and the
    database back-end, and initially write a mock-up for the middleware that
    waits 0.08 seconds and then return "success".

    We set up a JMeter script that sends a series of single requests to the
    UI from a single user, averaging one per second, and look at all but the
    first few samples. That's the number to plot on your diagram, and
    compare with the red and green lines. In this case, the production red
    line would be at 0.1 second, and we'd watch out for exceeding it, and
    also for trends that suggests we're going to exceed in in the next
    sprint. Either is a hint to schedule some profiling and refactoring.

    The second test is for scalability. Programs under load start off fast,
    stay pretty fast under increasing load, and then suddenly get slower and
    slower, as soon as you exceed some particular load.

    If you draw a chart of response time versus load of the program we're
    describing, it will start off almost horizontal for one or two users,
    creep up a bit more until you get to eight or so, and then start rising
    (getting slower) very quickly. If you keep increasing the load in users,
    you'll find it turns into an almost straight line going up at perhaps 45
    degrees, forever. It looks like a hockey-stick: a short horizontal
    blade, a curve upwards and a long, straight handle. The curve is
    actually a hyperbola drawn between a horizontal and a slanted line.

    We want to measure the amount it slows down under load as it approaches
    and passes our target number of users, so we set up JMeter to run with
    increasing numbers of users until we're well past the target load.
    We'll probably run from 1 to 15, and plot that. If the program isn't
    scaling well under load, the curve will start curling upwards early, and
    exceed 1/10 of a second well before we reach 10 users.

    If it does, we have a bottleneck, and we need to plan to do two things
    in the next sprint: check that our algorithm is supposed to scale, and
    find the slowest part of the program. If we have a bad algorithm, like
    bubble sort, then we'd better change it. Otherwise we profile the
    program and find out what's slowing us down.

    If it doesn't degrade much, you know you don't have to do anything, and
    can even chose to spend some of your time budget on slowish features.

    At some point in the development, it is wise to do a little bit of
    bottleneck-hunting, but not too much. We're not optimizing, because that
    is specializing the program for some particular use case, just removing
    performance bugs. Removing bugs is always a good idea, done in moderation.

    if you're too successful at improving performance, you may need to use a
    trick that dates back to Multics: put a timer in the UI that keeps it
    from returning until at least 0.05 seconds have elapsed. That keeps
    users from expecting amazingly fast results all the time, just because
    they've seen them a few times when everything was quiet.

    When the program is approaching shippable, the load/performance curve
    we've been measuring will be the first part of the capacity planning
    effort for the program. Unlike estimates based on CPU/memory/IOPs and
    the like, estimates based on time and measured load are usable for
    estimating the performance of similar but larger machines.

Add a Comment