Dealing with emergencies in Agile teams

Serge Beaumont

Every Agile team has to deal with whatever they've put out in the wild next to their "regular" work. How to handle the - by definition - unknown load of production emergencies when you're trying to achieve a stable pace? You can deal with emergencies by performing triage to either reject, defer or accept. You can set up a buffer to absorb some of the uncertainty, and finally you should make sure that you take the time to reduce the number of emergencies by building quality in. If you find you are mostly doing maintenance, you can consider doing Kanban.

The Context

In ye olden days of waterfall projects I never had to deal with that horror of horrors, maintenance. I'd be part of a team building something new, and you could keep going on until the end of the project. It was the maintenance department that would have to deal with the nonsense I had created. Ah, those were the days... all the fun without the hangover afterwards 🙂

But Agile teams, or in fact any team that starts delivering early and often (in my later waterfall days I'd already started to figure out that maintenance pretty much starts after the first two weeks... 🙂 ) deliver long before it's even possible to hand the project over - if at all. The nature of frequent delivery means that the team has to deal with all issues that arise themselves. The first reason is because they are the ones who can do it, the second is that you want to integrate fixes into the team's work anyway: they still need to deliver new versions of that same software...

In my consultancy work I've seen this issue come up with every single Agile team I've known, so this is not a unique situation for a small number of teams. All Agile teams have learn how to deal with this issue!

The Problem

People running in with emergencies

In a Scrum team, the problem will generally surface after one or more Sprints where a number of "production incidents" or similar unplanned mayhem took up so much of the team's time that they did not achieve their planned Sprint goal. The result is that a team has a hard time planning for the next Sprints. The first problem is that they do not know their "real" Velocity, the second is that they have to somehow factor in the - by definition unpredictable - production incidents.

But watch out, there is a pitfall hidden in the above paragraph. Predictability is not the end goal in Agile! Predictability is important to know when a release is shipped, and to know how to pace the team. But I've seen too many cases where teams try to "predict harder" when they should be adapting better. When dealing with the unpredictable, the focus should be on adaptation first, not on more planning beforehand. That would be a return to The Way Of The Waterfall...

The Goal

So there we have it: the goal is to be able to absorb a reasonable amount of uncertainty, striking a balance between robustness and speed.

The Solutions

Before I present some solutions, let me state this right away: if the amount work of unplanned production incidents is significant compared to the "regular" work, there is no way you can achieve sufficient stability. You'll need to fix the root causes of all those production issues first. More on that later.

Solution 1: Perform Triage - and Reject

The first thing to check is if you want to fix that production issue at all. This is not as silly as it might seem at first. There are so many cases where a production emergency is not an emergency at all, and should not even have been brought in in the first place! Some examples of "noncidents":

  • Sales storms in with "the deal of the century": "If we get feature X in NOW, we can win over customer X!". In my experience this is always due to an uneducated and undisciplined Sales department. The root cause here is that Sales promised things they shouldn't have, and they need to save their own skin now. It is ALWAYS possible to wait two weeks for a new feature.
  • Some stakeholders "upgrade" normal requests to production emergencies in an attempt to bypass the negotiations around the backlog. "It's a blocking issue that I can't get that feature!". "Oh? Did the system crash? Is something not working?". "Well... no, but it's a real blocker for my work!". That stakeholder may have a genuine need, but that does not make it a production emergency.

So solution 1 is: a strong Product Owner who performs triage on all production issues. If it's a real production issue then by all means fix it. But I can guarantee that you'll find a good number of issues that should not be emergencies at all... BTW, a Product Owner performing triage in this way is what James Coplien calls a Firewall in his organizational patterns book.

Solution 2: Perform Triage - and Defer the fix until at least the next Sprint

"We found this really big problem! We need it fixed right now!". "Sure, we'll get right on it. How long has this issue been in the system?". "Well, for over a year, but we just found out about it!". "It's been in there for a year? ...And you can't wait two more weeks for a fix?"

Solution 2 is an extension of Solution 1. An emergency might indeed be important to fix, but there's an important criterion to an emergency: it's only an emergency if it must be fixed in the current Sprint. If you can defer the problem to next Sprint, there is no problem! The team can pick it up as part of their regular process, plan it, build it, and deliver at the end of next Sprint. Again this is a Product Owner responsibility: next to the decision to reject, a good Product Owner will make sure that everything that can be deferred will be.

Solution 3: Reserve a buffer to deal with unexpected issues

If you've done Solutions 1 and 2, whatever you're left with should be real issues that you have to fix as soon as possible. The best way I know to deal with this is to reserve a buffer of time or story points that is left unplanned. This works especially well if the historical workload of any issues coming up is reasonably stable. You do not know what you'll be doing, but you know how much effort it will take.

Watch out though, using a buffer can blow up in your face! The first danger is the size of the buffer. If the buffer is a significant percentage of the Sprint, say more that 1/5 of your velocity, then you'll end up with a big hole in your planning process. So follow Buffer Rule 1: the buffer is not for backlog items. Try to keep the buffer as small as possible.

The second danger with using buffers is what I already discussed in Solution 1: the moment your stakeholder smell a workaround in the regular process, you can be sure they'll dive onto it. A buffer really, really needs to be protected from unintended use. So perform good triage!

The third danger is buffer overflow. Just like in a computer this leads to blowing up the process. If the buffer is used, you'll need to track how much of the buffer has been used, otherwise you'll be in for a surprise at the end of the Sprint.

Solution 4: Fix root causes, improve quality

This solution is presented as number 4 because the first three are in logical order when you're trying to control the damage, but in the end you'll want to do the most important thing of all: fix issues so they stay fixed, build in quality so that you don't have emergencies at all!. Now this is something we should be doing anyway, and is not unique for Agile projects: you want to do this in any project! But there is an extra Buffer Rule that is relevant in this respect (Credit goes to Jeff Sutherland on this one, I learned this rule when we do CSM trainings). Buffer Rule 2: If you overflow the buffer, abort the Sprint. If you have such issues that you can not even keep emergency work limited to a small buffer, you have no business trying to make progress building in features. Abort, use the Sprint to fix underlying root causes, and try again next Sprint. Coincidentally, Buffer Rule 2 also works wonders for all those stakeholders trying to "upgrade" their own agenda: "do you really want that issue fixed now? The team estimates that this is two points of work, and this would overflow the buffer. We would have to abort the Sprint, and you also would not get those other user stories you asked for! Oh... um. Well, I guess it isn't that much of a problem..." (And it wasn't... Real story!).

Extra: Size the team right

Team size is not a central focus in dealing with emergencies, but it is a factor to be aware of. A small team performs better because it has less overhead, but it is less robust against losing members. A small team is less robust against things like illness or something that pulls a team member away like... production emergencies maybe?. On a 10 person team losing one person "only" means a hit of about 10% in productivity (this is a simplified calculation of course, this assumes all team members are totally replaceable on a moments notice), in a three person team losing that same person would already mean a whopping 33%! The sweet spot tends to be around 7-9 people. Small enough to reduce overhead, large enough to absorb some production loss.

And finally... consider using Kanban instead of Scrum

If you find that your team is doing more maintenance than "new stuff", you might consider using Kanban instead. This is because the granularity of Kanban is stories, not Sprints. If there is a production emengency the is already an intrinsic shorter wait for it to be picked up because of this. Kanban is about flow, while Scrum is about iterations. The two styles are close enough that I've seen a Scrum team transition into "flow mode" when they scaled down and only did maintenance, and went back to Scrum when a new release was planned, and they scaled up again.

In Conclusion

Every Agile team has to deal with whatever they've put out in the wild next to their "regular" work. You can deal with emergencies by performing triage to either reject, defer or accept. You can set up a buffer to absorb some of the uncertainty, and finally you should make sure that you take the time to reduce the number of emergencies by building quality in. If you find you are mostly doing maintenance, you can consider doing Kanban.

Comments (18)

  1. Martien van Steenbergen - Reply

    February 28, 2011 at 11:07 am

    Wonderful article, Serge. Really love your sketched. They give a personal touch as well as clear things up.

    Wish you well.

    P.S. My wish is too, that your article gets retrofitted into the Scrum pattern language…

  2. Michael Sahota - Reply

    February 28, 2011 at 4:46 pm

    Serge, great post. I love the drawings and thorough explanation.

    I would even go a step farther and argue that "no interruptions" in Scrum is an anti-pattern -

  3. Jason Fair - Reply

    February 28, 2011 at 10:23 pm

    Great article. Agree with your recommendations. I specialize in Agile in ERP, and including "unplanned events" in the sprint is imperative to being able to manage expectations with stakeholders as well as deal with integration and dependencies that are inherent with ERP systems.


  4. Jarl Meijer - Reply

    March 2, 2011 at 1:27 pm

    I really like this overview, and the cartoons!
    In my experience a simple question can help to reduce the number of emergency calls in many organisations: "Does the team needs to solve this problem, or can someone else do it as well?". This certainly holds for non-coding issues like analysis of a problem ("Why did this client did receive only 5 transactions yesterday"), configuration, or other. I often see issues being dropped too easily at the one-who-made-it, the-one-who-is-our-hero or the-team-who-is-doing-many-other-things-for-me. Sometimes a little instruction or an extra autorisation can keep work outside the team, which is a special case of solution number 4.

  5. Jenna Pederson - Reply

    March 10, 2011 at 5:39 am

    Great post! This has been something I've been struggling with for awhile and have been thinking about this week.

    Another gotcha to using a buffer, and maybe this is what you were touching on, is that if stakeholders get wind of there being a buffer for "unplanned" work and it's even occasionally not all used up, they will expect that you can just deliver more stories instead. There can be some expectation that everything is planned and anything unplanned is not adding value or not growing revenue. This definitely comes back to setting expectations and "protecting" the buffer.

  6. Evelijn van Leeuwen - Reply

    March 14, 2011 at 8:46 pm

    Serge, I do like your post!

  7. Thomas Quaidoo - Reply

    March 16, 2011 at 11:30 pm

    Serge I love your article and how it addresses this issue of maintenance, the various elements that may be in play, and the variety of solutions offered. All agile teams employ their own unique flavors and so such a comprehensive analysis and recommendation is extremely helpful.

  8. Rob Watson - Reply

    March 17, 2011 at 2:37 pm

    Great article - your points on the true definition of an emergency and the key role of a strong product owner are particularly well made.

    Personally I wouldn't use a buffer, for the reasons stated, but I always like to get the product owner to use strict MoSCoW prioritisation for the stories they expect to be delivered in the sprint. You plan the sprint to include all of the "must haves", and some of the "should haves". You can still deliver a working product as long as you have all of the "musts". The "should haves" then become your buffer, and it's entirely up to the product owner to decide whether the "emergency" is more or less important than a "should have." If it's more important than a must have, then by definition the sprint is aborted.

  9. Fabrice Aimetti - Reply

    May 25, 2011 at 6:06 pm

    Hello Serge,
    This post is very interesting. I've translated it into french :
    Faire face aux urgences dans les équipes Agile


  10. Serge Beaumont - Reply

    May 26, 2011 at 12:15 pm

    Cool Fabrice, thanks for that. Despite my french name my command of the French language is not up to that task... 🙂

  11. Serge Beaumont - Reply

    May 26, 2011 at 1:04 pm

    @Rob, good point. Basically you designate the bottom stories as being swappable for an emergency. It's an implicit buffer mechanism: I like it. Thanks for that. All these tricks we have are dependent on context, and it's nice to have a broad range of options to choose from! 😎

  12. [...] in a bad way. It’s a challenge that a lot of teams face: unplanned change. The blog post dealing with emergencies explains the problem and possible solutions very [...]

  13. Jay - Reply

    July 30, 2015 at 11:37 pm

    Great article! And timely.

  14. Robert - Reply

    August 10, 2015 at 10:03 pm

    Another question this article provokes: How often to perform a triage? If it's truly an emergency issue, do you call ad-hoc triage meetings the moment you hear about these supposed "emergency issues"? What is the communication workflow like for this?

    Where I work, it's normally product management that comes to our team directly and says "We have a hot issue!". Then it's just a matter of finding someone to work on it. It's pretty direct to development that way. However, we plan buckets for these unpredictable "hot" issues. Normally they come from UAT or production field issues.

    • Serge Beaumont - Reply

      August 25, 2015 at 11:33 am

      Hello Robert,

      Sorry for the late reply: holidays and all that.

      The answer to "how often to perform a triage" is "all the time (within practical boundaries)". The role of the PO must be present continuously because the world does not work in neat iterations. Stuff comes in all the time. So the triage "function" should be ready to perform on a moment's notice. However, in practice it becomes impractical if a Product Owner would have to be present to be the go-between for these things all the time. So we delegate as much as we can to the team.

      I would suggest that you make an agreement between the team and the PO about levels of emergencies and how to respond to them. Yes, this basically means stealing 90% of any Service Level Agreement you have. Of each of these levels you decide with the team how to respond. Here is a semi-made up example of a semi-hypothetical team:

      Blocking issues that are visible in the UI: must be handled by PO
      Blocking issues that are not visible in UI: team tech lead decides to place within buffer or not. Inform PO.
      Non-blocking but high priority issues: discuss in bug triage meeting every wednesday, with whole team (which means including PO btw)
      ... and so on. You'll probably get the idea.

      Emergencies that need sub-hour response times, that will always be an ad-hoc thing, but the next level down would be to put it on the Scrum Board on the Fast Lane, and discuss it during standup. Daily tends to be more than good enough for 99% of the cases.

      With respect to your relation to product management, it's good to be responsive, but I'd definitely keep an eye on the true "hotness" of issues. Many of them are only "hot" because the person who comes in waited too long: this is artificial "hotness". Recognizing these cases are a good subject for a retrospective.

      I recently gave a talk on Xebicon 2015 where I also touch on the subject. You can find the video and slides here: Look under "Videos and Slides" for the "24 man Devops Team" presentation.

      Hope that helps!

  15. sheetal - Reply

    February 20, 2016 at 6:40 am

    Hi Serge,
    Great write,it was really helpful.Can i ask an extended question- How to return to normalcy or even plan to return to normal planning. It can never be one time, it has to be gradually i suppose.

    • Serge Beaumont - Reply

      February 21, 2016 at 11:36 pm

      Hey Sheetal,

      The buffer trick is meant to absorb a reasonable amount of "abnormalcy" without disrupting the overall cadence of a team. If you are in a situation where you can not even have a normal planning with the use of the buffer I can only conclude that the buffer has overflowed. The strict rule here, as professed by Jeff Sutherland, is to immediately abort the Sprint on an overflow and replan the remainder. The reason is that your Sprint has been disrupted so much that the original Sprint plan is beyond rescuing.

      The second thing that must be done is to address the top reason for the buffer overflow as the top impediment, and put the improvement as the top item of your backlog. This way of working is very much inspired by the "Hold the Line"/Andon Cord of Lean: whenever a station in a production line threatens to deliver a problem to a next station (note the buffer here is "as long as the problem stays within your own station"), someone will pull the Andon cord and make the WHOLE production line stop, and everybody dives on top of the problem.

      Now here is the most difficult part: the environment must allow this "holding the line", which is often not the case because there is so much pressure to just keep on going, however broken the situation is. The "hold the line" mentality believes that stopping, analyzing and fixing a problem makes you go much faster, especially if you keep it up over a longer period, than the investment of that stop cost you. I often tell people I coach that they can just as well throw out the retrospective if they never do anything with the findings, which is a similar idea.

      Finally I would like to suggest that you look at the A3 process (also a Lean thing, Scrum does not have any explicit improvement tools) to tackle the impediments that are so large that they need a whole "improvement project" to fix them. The standard work on this is considered to be "The A3 Process" by Durward K. Sobek II.

      Hope that helps, good luck!

Add a Comment