Pining for the good ol' days

Today, Norbert Preining wrote a blog post syndicated in planet talking about Sarah Sharp quitting Linux development, and making a parallel with the Debian community.

Once again, he's complaining about how the fun from Debian has been lost because making sexist jokes, or treating other people like shit is not allowed any more. He seems to think the LKML is the ideal environment and that Debian should be more like it.

He also celebrates that Ms Sharp[^SJW][^SJW2] left Linux, because evidently the freedom to be a jerk to each other is more important than the contributions that she -or any other person that finds that atmosphere toxic, like mjg- has produced or could have produced in the future.

[^SJW]: Whom he calls "one more SJW", I guess then that after all, this is just about ethics in FOSS development.

[^SJW2]: At this point, using the term SJW in a discussion should equate to a Godwin, and mean that you lost the argument.


I would like to tell readers of Planet Debian that this view is not the norm in Debian any more. We are working hard to make the project more inclusive, fun, and welcoming to everybody.

Sadly, some people are still pining for the good ol' days, and won't give up so easily on their privilege.

Tales from the SRE trenches - Part 1

A few weeks ago, I was offered the opportunity to give a guest talk in the Romanian Association for Better Software.

RABS is a group of people interested in improving the trade, and regularly hold events where invited speakers give presentations on a wide array of topics. The speakers are usually pretty high-profile, so this is quite a responsibility! To make things more interesting, much of the target audience works on enterprise software, under Windows platforms. Definitely outside my comfort zone!

Considering all this, we decided the topic was going to be about Site Reliability Engineering (SRE), concentrating on some aspects of it which I believe could be useful independently of the kind of company you are working for.

I finally gave the talk last Monday, and the audience seemed to enjoy it, so I am going to post here my notes, hopefully some other people will like it too.

Why should I care?

I prepared this thinking of an audience of software engineers, so why would anyone want to hear about this idea that only seems to be about making the life of the operations people better?

The thing is, having your work as a development team supported by an SRE team will also benefit you. This is not about empowering Ops to hit you harder when things blow apart, but to have a team that is your partner. A partner that will help you grow, handle the complexities of a production environment so you can concentrate on cool features, and that will get out of the way when things are running fine.

A development team may seem to only care about adding features that will drive more and more users to your service. But an unreliable service is a service that loses users, so you should care about reliability. And what better to have a team has Reliability on their name?

What is SRE?

SRE means Site Reliability Engineering, Reliability Engineering applied to "sites". Wikipedia defines Reliability Engineering as:

[..] engineering that emphasizes dependability in the life-cycle management of a product.

This is, historically, a branch of engineering that made possible to build devices that will work as expected even when their components were inherently unreliable. It focused on improving component reliability, establishing minimum requirements and expectations, and a heavy usage of statistics to predict failures and understand underlying problems.

SRE started as a concept at Google about 12 years ago, when Ben Treynor joined the company and created the SRE team from a group of 7 production engineers. There is no good definition of what Site Reliability Engineering means; while the term and some of its ideas are clearly inspired in the more traditional RE, he defines SRE with these words[^BT1]:

[^BT1]: http://www.site-reliability-engineering.info/2014/04/what-is-site-reliability-engineering.html

Fundamentally, it's what happens when you ask a software engineer to design an operations function.

Only hire coders

After reading that quote it is not surprising that the first item in the SRE checklist[^checklist], is to only hire people who can code properly for SRE roles. Writing software is a key part of being SRE. But this does not mean that there is no separation between development and operations, nor that SRE is a fancy(er) name for DevOps[^DevOps].

[^checklist]: SRE checklist extracted from Treynor's talk at SREcon14: https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre

[^DevOps]: By the way, I am still not sure what DevOps mean, it seems that everyone has a different definition for it.

It means treating operations as a software engineering problem, using software to solve problems that used to be solved by hand, implementing rigorous testing and code reviewing, and taking decisions based on data, not just hunches.

It also implies that SREs can understand the product they are supporting, and that there is a common ground and respect between SREs and software engineers (SWEs).

There are many things that make SRE what it is, some of these only make sense within a special kind of company like Google: many different development and operations teams, service growth that can't be matched by hiring, and more importantly, firm commitment from top management to implement these drastic rules.

Therefore, my focus here is not to preach on how everybody should adopt SRE, but to extract some of the most useful ideas that can be applied in a wider array of situations. Nevertheless, I will first try to give an overview of how SRE works at Google.


That's it for today. In the next post I will talk about how to end the war between developers and SysAdmins. Stay tuned!

All the artciles in the series: part 2, part 3, part 4, and part 5.

Tales from the SRE trenches: Dev vs Ops

This is the second part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.

On the first part, I introduced briefly what is SRE. Today, I present some concrete ways in which SRE tried to make things better, by stopping the war between developers and SysAdmins.


Dev vs Ops: the eternal battle

So, it starts at looking at the problem: how to increase the reliability of the service? It turns out that some of the biggest sources of outages are new launches: a new feature that seemed innocuous somehow managed to bring the whole site down.

Devs want to launch, and Ops want to have a quiet weekend, and this is were the struggle begins. When launches are problematic, bureaucracy is put in place to minimise the risks: launch reviews, checklists, long-lived canaries. This is followed by development teams finding ways of side-stepping those hurdles. Nobody is happy.

One of the key aspects of SRE is to avoid this conflict completely, by changing the incentives, so these pressures between development and operations disappear. At Google, they achieve this with a few different strategies:

Have an SLA for your service

Before any service can be supported by SRE, it has to be determined what is the level of availability that it must achieve to make the users and the company happy: this is called the Service Level Agreement (SLA).

The SLA will define how availability is measured (for example, percentage of queries handled successfully in less than 50ms during the last quarter), and what is the minimum acceptable value for it (the Service Level Objective, or SLO). Note that this is a product decision, not a technical one.

This number is very important for an SRE team and its relationship with the developers. It is not taken lightly, and it should be measured and enforced rigorously (more on that later).

Only a few things on earth really require 100% availability (pacemakers, for example), and achieving really high availability is very costly. Most of us are dealing with more mundane affairs, and in the case of websites, there are many other things that fail pretty often: network glitches, OS freezes, browsers being slow, etc.

So an SLO of 100% is almost never a good idea, and in most cases it is impossible to reach. In places like Google an SLO of "five nines" (99.999%) is not uncommon, and this means that the service can't fail completely for more than 5 minutes across a whole year!

Measure and report performance against SLA/SLO

Once you have a defined SLA and SLO, it is very important that these are monitored accurately and reported constantly. If you wait for the end of the quarter to produce a hand-made report, the SLA is useless, as you only know you broke it when it is too late.

You need automated and accurate monitoring of your service level, and this means that the SLA has to be concrete and actionable. Fuzzy requirements that can't be measured are just a waste of time.

This is a very important tool for SRE, as it allows to see the progression of the service over time, detect capacity issues before they become outages, and at the same time show how much downtime can be taken without breaking the SLA. Which brings us to one core aspect of SRE:

Use error budgets and gate launches on them

If SLO is the minimum rate of availability, then the result of calculating 1 - SLO is what fraction of the time a service can fail without failing out of the SLA. This is called an error budget, and you get to use it the way you want it.

If the service is flaky (e.g. it fails consistently 1 of every 10000 requests), most of that budget is just wasted and you won't have any margin for launching riskier changes.

On the other hand, a stable service that does not eat the budget away gives you the chance to bet part of it on releasing more often, and getting your new features quicker to the user.

The moment the error budget is spent, no more launches are allowed until the average goes back out of the red.

Once everyone can see how the service is performing against this agreed contract, many of the traditional sources of conflict between development and operations just disappear.

If the service is working as intended, then SRE does not need to interfere on new feature launches: SRE trusts the developers' judgement. Instead of stopping a launch because it seems risky or under-tested, there are hard numbers that take the decisions for you.

Traditionally, Devs get frustrated when they want to release, but Ops won't accept it. Ops thinks there will be problems, but it is difficult to back this feeling with hard data. This fuels resentment and distrust, and management is never pleased. Using error budgets based on already established SLAs means there is nobody to get upset at: SRE does not need to play bad cop, and SWE is free to innovate as much as they want, as long as things don't break.

At the same time, this provides a strong incentive for developers to avoid risking their budget in poorly-prepared launches, to perform staged deployments, and to make sure the error budget is not wasted by recurrent issues.


That's all for today. The next article will continue delving into how traditional tensions between Devs and Ops are played in the SRE world.

Tales from the SRE trenches: SREs are not firefighters

This is the third part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.

If you haven't already, you might want to read part 1 and part 2 first.

In this post, I talk about some strategies to avoid drowning SRE people in operational work.


SREs are not firefighters

As I said before, it is very important that there is trust and respect between Dev and Ops. If Ops is seen as just an army of firefighters that will gladly put away fires at any time of the night, there are less incentives to make good software. Conversely, if Dev is shielded from the realities of the production environment, Ops will tend to think of them as delusional and untrustworthy.

Common staffing pool for SRE and SWE

The first tool to combat this at Google -and possibly a difficult one to implement in small organisations- is to have single headcount budgets for Dev and Ops. That means that the more SREs you need to support your service, the less developers you have to write it.

Combined with this, Google offers the possibility to SREs to move freely between teams, or even to transfer out to SWE. Because of this, a service that is painful to support will see the most senior SREs leaving and will only be able to afford less experienced hires.

All this is a strong incentive to write good quality software, to work closely and to listen to the Ops people.

Share 5% of operational work with the SWE team

On the other hand, it is necessary that developers see first hand how the service works in production, understand the problems, and share the pain of things failing.

To this end, SWEs are expected to take on a small fraction of the operational work from SRE: handling tickets, being on-call, performing deployments, or managing capacity. This results in better communication among the teams and a common understanding of priorities.

Cap operational load at 50%

One very uncommon rule from SRE at Google, is that SREs are not supposed to spend more than half of their time on "operational work".

SREs are supposed to be spending their time on automation, monitoring, forecasting growth... Not on repeatedly fixing manually issues that stem from bad systems.

Excess operational work overflows to SWE

If an SRE team is found to be spending too much time on operational work, that extra load is automatically reassigned to the development team, so SRE can keep doing their non-operational duties.

On extreme cases, a service might be deemed too unstable to maintain, and SRE support is completely removed: it means the development team now has to carry pagers and do all the operational work themselves. It is a nuclear option, but the threat of it happening is a strong incentive to keep things sane.


The next post will be less about how to avoid unwanted work and more about the things that SRE actually do, and how these make things better.