Tales from the SRE trenches: Dev vs Ops

This is the second part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.

On the first part, I introduced briefly what is SRE. Today, I present some concrete ways in which SRE tried to make things better, by stopping the war between developers and SysAdmins.

Dev vs Ops: the eternal battle

So, it starts at looking at the problem: how to increase the reliability of the service? It turns out that some of the biggest sources of outages are new launches: a new feature that seemed innocuous somehow managed to bring the whole site down.

Devs want to launch, and Ops want to have a quiet weekend, and this is were the struggle begins. When launches are problematic, bureaucracy is put in place to minimise the risks: launch reviews, checklists, long-lived canaries. This is followed by development teams finding ways of side-stepping those hurdles. Nobody is happy.

One of the key aspects of SRE is to avoid this conflict completely, by changing the incentives, so these pressures between development and operations disappear. At Google, they achieve this with a few different strategies:

Have an SLA for your service

Before any service can be supported by SRE, it has to be determined what is the level of availability that it must achieve to make the users and the company happy: this is called the Service Level Agreement (SLA).

The SLA will define how availability is measured (for example, percentage of queries handled successfully in less than 50ms during the last quarter), and what is the minimum acceptable value for it (the Service Level Objective, or SLO). Note that this is a product decision, not a technical one.

This number is very important for an SRE team and its relationship with the developers. It is not taken lightly, and it should be measured and enforced rigorously (more on that later).

Only a few things on earth really require 100% availability (pacemakers, for example), and achieving really high availability is very costly. Most of us are dealing with more mundane affairs, and in the case of websites, there are many other things that fail pretty often: network glitches, OS freezes, browsers being slow, etc.

So an SLO of 100% is almost never a good idea, and in most cases it is impossible to reach. In places like Google an SLO of "five nines" (99.999%) is not uncommon, and this means that the service can't fail completely for more than 5 minutes across a whole year!

Measure and report performance against SLA/SLO

Once you have a defined SLA and SLO, it is very important that these are monitored accurately and reported constantly. If you wait for the end of the quarter to produce a hand-made report, the SLA is useless, as you only know you broke it when it is too late.

You need automated and accurate monitoring of your service level, and this means that the SLA has to be concrete and actionable. Fuzzy requirements that can't be measured are just a waste of time.

This is a very important tool for SRE, as it allows to see the progression of the service over time, detect capacity issues before they become outages, and at the same time show how much downtime can be taken without breaking the SLA. Which brings us to one core aspect of SRE:

Use error budgets and gate launches on them

If SLO is the minimum rate of availability, then the result of calculating 1 - SLO is what fraction of the time a service can fail without failing out of the SLA. This is called an error budget, and you get to use it the way you want it.

If the service is flaky (e.g. it fails consistently 1 of every 10000 requests), most of that budget is just wasted and you won't have any margin for launching riskier changes.

On the other hand, a stable service that does not eat the budget away gives you the chance to bet part of it on releasing more often, and getting your new features quicker to the user.

The moment the error budget is spent, no more launches are allowed until the average goes back out of the red.

Once everyone can see how the service is performing against this agreed contract, many of the traditional sources of conflict between development and operations just disappear.

If the service is working as intended, then SRE does not need to interfere on new feature launches: SRE trusts the developers' judgement. Instead of stopping a launch because it seems risky or under-tested, there are hard numbers that take the decisions for you.

Traditionally, Devs get frustrated when they want to release, but Ops won't accept it. Ops thinks there will be problems, but it is difficult to back this feeling with hard data. This fuels resentment and distrust, and management is never pleased. Using error budgets based on already established SLAs means there is nobody to get upset at: SRE does not need to play bad cop, and SWE is free to innovate as much as they want, as long as things don't break.

At the same time, this provides a strong incentive for developers to avoid risking their budget in poorly-prepared launches, to perform staged deployments, and to make sure the error budget is not wasted by recurrent issues.

That's all for today. The next article will continue delving into how traditional tensions between Devs and Ops are played in the SRE world.