Tales from the SRE trenches: What can SRE do for you?

This is the fourth part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.

Previous articles can be found here: part 1, part 2, and part 3.

As a side note, I am writing this from Heathrow Airport, about to board my plane to EZE. Buenos Aires, here I come!

So, what does SRE actually do?

Enough about how to keep the holy wars under control and how to get work away from Ops. Let's talk about some of the things that SRE does.

Obviously, SRE runs your service, performing all the traditional SysAdmin duties needed for your cat pictures to reach their consumers: provisioning, configuration, resource allocation, etc.

They use specialised tools to monitor the service, and get alerted as soon as a problem is detected. They are also the ones waking up to fix your bugs in the middle of the night.

But that is not the whole story: reliability is not only impacted by new launches. Suddenly usage grows, hardware fails, networks lose packets, solar flares flip your bits... When working at scale, things are going to break more often than not.

You want these breakages to affect your availability as little as possible, there are three strategies you can apply to this end: minimise the impact of each failure, recover quickly, and avoid repetition.

And of course, the best strategy of them all: preventing outages from happening at all.

Minimizing the impact of failures

A single failure that takes the whole service down will affect severely your availability, so the first strategy as an SRE is to make sure your service is fragmented and distributed across what is called "failure domains". If the data center catches fire, you are going to have a bad day, but it is a lot better if only a fraction of your users depend on that one data center, while the others keep happily browsing cat pictures.

SREs spend a lot of their time planning and deploying systems that span the globe to maximise redundancy while keeping latencies at reasonable levels.

Recovering quickly

Many times, retrying after a failure is actually the best option. So another strategy is to automatically restart in milliseconds any piece of your system that fails. This way, less users are affected, while a human has time to investigate and fix the real problem.

If a human needs to intervene, it is crucial that they get notified as fast as possible, and that they have quick access to all the information that is needed to solve the problem: detailed documentation of the production environment, meaningful and extensive monitoring, play-books[^playbook], etc. After all, at 2 AM you probably don't remember in which country the database shard lives, or what were the series of commands to redirect all the traffic to a different location.

[^playbook]: A play-book is a document containing critical information needed when dealing with a specific alert: possible causes, hints for troubleshooting, links to more documentation. Some monitoring systems will automatically add a play-book URL to every alert sent, and you should be doing it too.

Implementing monitoring and automated alerts, writing documentation and play-books, and practising for disaster are other areas where SREs devote much effort.

Avoiding repetition

Outages happen, pages ring, problems get fixed, but it should always be a chance for learning and improving. A page should require a human to think about a problem and find a solution. If the same problem keeps appearing, the human is not needed any more: the problem needs to be fixed at the root.

Another key aspect of dealing with incidents is to write post-mortems[^postmortem]. Every incident should have one, and these should be tools for learning, not finger-pointing. Post-mortems can be an excellent tool, if people are honest about their mistakes, they are not used to blame other people, and issues are followed up by bug reports and discussions.

[^postmortem]: Similarly to their real-life counterparts, a post-mortem is the process of examining a dead body (the outage), gutting it out and trying to understand what caused its demise.

Preventing outages

Of course nobody can prevent hard drives from failing, but there are certain classes of outages that can be forecasted with careful analysis.

Usage spikes can bring a service down, but an SRE team will ensure that the systems are load-tested at higher-than-normal rates. They could also be prepared to quickly scale the service, provided the monitoring system will alert as soon as a certain threshold is reached.

Monitoring is actually the key part in this: measuring relevant metrics for all the components of a system, and following their trends over time, SREs can completely avoid many outages: latencies growing out of acceptable bounds, disks filling up, progressive degradation of components, are all examples of typical problems automatically monitored (and alerted on) in an SRE team.

This is getting pretty long, so the next post will be the last one of these series, with some extra tips from SRE experience that I am sure can be applied in many places.