This is the third part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.
In this post, I talk about some strategies to avoid drowning SRE people in operational work.
SREs are not firefighters
As I said before, it is very important that there is trust and respect between Dev and Ops. If Ops is seen as just an army of firefighters that will gladly put away fires at any time of the night, there are less incentives to make good software. Conversely, if Dev is shielded from the realities of the production environment, Ops will tend to think of them as delusional and untrustworthy.
Common staffing pool for SRE and SWE
The first tool to combat this at Google -and possibly a difficult one to implement in small organisations- is to have single headcount budgets for Dev and Ops. That means that the more SREs you need to support your service, the less developers you have to write it.
Combined with this, Google offers the possibility to SREs to move freely between teams, or even to transfer out to SWE. Because of this, a service that is painful to support will see the most senior SREs leaving and will only be able to afford less experienced hires.
All this is a strong incentive to write good quality software, to work closely and to listen to the Ops people.
Share 5% of operational work with the SWE team
On the other hand, it is necessary that developers see first hand how the service works in production, understand the problems, and share the pain of things failing.
To this end, SWEs are expected to take on a small fraction of the operational work from SRE: handling tickets, being on-call, performing deployments, or managing capacity. This results in better communication among the teams and a common understanding of priorities.
Cap operational load at 50%
One very uncommon rule from SRE at Google, is that SREs are not supposed to spend more than half of their time on "operational work".
SREs are supposed to be spending their time on automation, monitoring, forecasting growth... Not on repeatedly fixing manually issues that stem from bad systems.
Excess operational work overflows to SWE
If an SRE team is found to be spending too much time on operational work, that extra load is automatically reassigned to the development team, so SRE can keep doing their non-operational duties.
On extreme cases, a service might be deemed too unstable to maintain, and SRE support is completely removed: it means the development team now has to carry pagers and do all the operational work themselves. It is a nuclear option, but the threat of it happening is a strong incentive to keep things sane.
The next post will be less about how to avoid unwanted work and more about the things that SRE actually do, and how these make things better.