ERROR BUDGET

What level of reliability do you need from an application?  The SRE view is that you will never achieve 100%, and so what is the realistic figure?  The figure is often expressed as a number of 9s e.g. four 9s is 99.99%. 

Error budget is what’s left over.  If we need a reliability of 99.99% then we have an available error budget 0.01%.  Keeping it simple, let’s imagine that we are using availability of an application as reliability.  An error budget of 0.01% represents 4.3 minutes of outage available in a month.

The reasoning goes like this.  As long as the service isn’t out for longer than 4.3 seconds we are within your error budget.  If we run out of error budget then there is a freeze on the service i.e. no new releases other than fixes that get you back inside the error budget.

So why fix any long running problems if we are still within error budget?  Well imagine you have a problem that causes a 2 min outage every month.  You are wasting half your error budget.  That means less opportunity for change and a greater chance of freezes when additional problems occur.

You can see how this encourages developers to deliver better code and fix bugs quickly.  It’s in their interest to do so or they won’t be able to deploy new releases.