Many people reading this have already suffered me talking to them about Prometheus. In personal conversation, or in the talks I gave at DebConf15 in Heidelberg, the Debian SunCamp in Lloret de Mar, BRMlab in Prague, and even at a talk on a different topic at the RABS in Cluj-Napoca.
Since their public announcement, I have been trying to support the project in the ways I could: by packaging it for Debian, and by spreading the word.
Last week the first ever Prometheus conference took place, so this time I did the opposite thing and I spoke about Debian to Prometheus people: I gave a 5 minutes lightning talk on Debian support for Prometheus.
What blew me away was the response I've got: from this tiny non-talk I prepared in 30 minutes, many people stopped me later to thank me for packaging Prometheus, and for Debian in general. They told me they were using my packages, gave me feedback, and even some RFPs!
At the post-conference beers, I had quite a few interesting discussions about Debian, release processes, library versioning, and culture clashes with upstream. I was expecting some Debian-bashing (we are old-fashioned, slow, etc.), instead I had intelligent and enriching conversations.
To me, this enforces once again my support and commitment to community conferences, where nobody is a VIP and everybody has something to share. It also taught me the value of intersecting different groups, even when there seems to be little in common.
Many people might not be aware of it, but since a couple of years ago, we have an excellent tool for tracking and recognising contributors to the Debian Project: Debian Contributors
Debian is a big project, and there are many people working that do not have great visibility, specially if they are not DDs or DMs. We are all volunteers, so it is very important that everybody gets credited for their work. No matter how small or unimportant they might think their work is, we need to recognise it!
One great feature of the system is that anybody can sign up to provide a new data source. If you have a way to create a list of people that is helping in your project, you can give them credit!
If you open the Contributors main page, you will get a list of all the groups with recent activity, and the people credited for their work. The data sources page gives information about each data source and who administers it.
For example, my Contributors page shows the many ways in which the system recognises me, all the way back to 2004! That includes commits to different projects, bug reports, and package uploads.
I have been maintaining a few of the data sources that track commits to Git and Subversion repositories:
- The Go packaging group (added just a couple of weeks ago).
- The Perl packaging group.
The last two are a bit problematic, as they group together all commits to the respective VCS repositories without distinguishing to which sub-projects the contributions were made.
The Go and Perl groups' contributions are already extracted from that big pile of data, but it would be much nicer if each substantial packaging team had their own data source. Sadly, my time is limited, so this is were you come into the picture!
If you are a member of a team, and want to help with this effort, adopt a new data source. You can be providing commit logs, but it is not limited to that; think of translators, event volunteers, BSP attendants, etc.
Do you fancy a hack-camp in a place like this?
As you might have heard by now, Ana (Guerrero) and I are organising a small Debian event this spring: the Debian SunCamp 2016.
It is going to be different to most other Debian events. Instead of a busy schedule of talks, SunCamp will focus on the hacking and socialising aspect.
We have tried to make this event the simplest event possible, both for organisers and attendees. There will be no schedule, except for the meal times at the hotel. But these can be ignored too, there is a lovely bar that serves snacks all day long, and plenty of restaurants and cafés around the village.
One of the things that makes the event simple, is that we have negotiated a flat price for accommodation that includes usage of all the facilities in the hotel, and optionally food. We will give you a booking code, and then you arrange your accommodation as you please, you can even stay longer if you feel like it!
The rooms are simple but pretty, and everything has been renovated very recently.
We are not preparing a talks programme, but we will provide the space and resources for talks if you feel inclined to prepare one.
You will have a huge meeting room, divided in 4 areas to reduce noise, where you can hack, have team discussions, or present talks.
Of course, some people will prefer to move their discussions to the outdoor area.
Or just skip the discussion, and have a good time with your Debian friends, playing pétanque, pool, air hockey, arcades, or board games.
Do you want to see more pictures? Check the full gallery
Debian SunCamp 2016
Hotel Anabel, LLoret de Mar, Province of Girona, Catalonia, Spain
May 26-29, 2016
Tempted already? Head to the wikipage and register now, it is only 7 weeks away!
Please reserve your room before the end of April. The hotel has reserved a number of rooms for us until that time. You can reserve a room after April, but we can't guarantee the hotel will still have free rooms.
If you take the Sarfa bus, they look like this:
If arriving at BCN terminal 2 (most low-cost carriers do), there are coach parkings between the B and C buildings. At the far end in the picture below, you can see the buses parked in both sides of the street. Sarfa is going to be picking up people from the right side (the same side as the terminal building B).
You have to get down at the "Lloret de Mar" bus station.
From there, you walk less than ten minutes, until you reach the hotel. There is a street that goes below it, and the entrance is to the left:
When you enter, the lobby is very bright, with views of the garden and swimming pool.
The rooms are simple, but comfortable. They have been recently renovated.
There are many amenities in the hotel, specially in the court yard, around the two outdoor swimming pools (there is also an indoors swimming pool in the spa area).
There are plenty of lounge chairs for sunbathing, or just chilling by the pool.
There is also a café/snack bar, with an outdoor terrace which has a moveable roof and heating -in case of need-.
You can enjoy playing pétanque, pool, air hockey, some arcades, and even board games.
The hotel seems well prepared for wheelchair access. Every stair has a ramp or an elevator next to it.
The room for hacklabs and talks is huge, with three divisions and independent accesses from the garden.
Before high season, when it gets full of tourists, you will be able to still enjoy a quiet village, the great beaches, and some sightseeing.
So I was reading G+, and saw there a post by Bernd Zeimetz about some "marble machine". Which turns out to be a very cool device that is programmed to play a single tune, and it is just mesmerising to watch:
So, naturally, I click through to see if there is more music made with this machine. It turns out the machine has been on the making for a while, and the first complete song (the one embedded above) was released only a few days ago. It is obviously porn for nerds, and Wired had already posted an article about it.
So instead I found a band called like the machine: Wintergatan, which sounds pretty great. It took me a while to realise the guy who built the machine is one of the members of the band. They even have a page collecting all the videos about the machine.
After a while, and noticing the suggestions from Youtube, I realise that two of the members of Wintergatan were previously in Detektivbyrån, which is another band I love, and about which I wrote a post on this very blog, 7.5 years ago!1. So the sad news is that Detektivbyran disbanded, the good news is that this guy keeps making great music, now with insane machines.
I only discovered Detektivbyran in the first place thanks to an article the -now sadly defunct- Coilhouse Magazine.
I find this 8-year long loop that closes unexpectedly during a late-night idle browsing session pretty amusing.
I keep telling my friends that I was a hipster before it was cool to do so... ↩
Dear node.js/node-webkit people, what's the matter with you?
I wanted to try out some stuff that requires node-webkit. So I try to use
npm to download, build and install it, like CPAN would do.
But then I see that the
nodewebkit package is just a stub that downloads a 37MB file (using HTTP without TLS) containing pre-compiled binaries. Are you guys out of your minds?
This is enough for me to never again get close to node.js and friends. I had already heard some awful stories, but this is just insane.
Update: This was a rage post, and not really saying anything substantial, but I want to quote here the first comment reply from Christian as I think it is much more interesting:
I see a recurring pattern here: developers create something that they think is going to simplify distributing things by quite a bit, because what other people have been doing is way too complicated[tm].
In the initial iteration it usually turns out to have three significant problems:
- a nightmare from a security perspective
- there's no concern for stability, because the people working on it are devs and not people who have certain reliability requirements
- bundling other software is often considered a good idea, because it makes life so much easier[tm]
Given a couple of years they start to catch up where the rest of the world (e.g. GNU/Linux distributions) is - at least to some extent. And then their solution is actually of a similar complexity compared to what is in use by other people, because they slowly realize that what they're trying to do is actually not that simple and that other people who have done things before were not stupid and the complexity is actually necessary... Just give node.js 4 or 5 more years or so, then you won't be so disappointed.
I've seen this pattern over and over again:
the Ruby ecosystem was a catastrophe for quite some time (just ask e.g. the packages of Ruby software in distributions 4 or 5 years ago), and only now would I actually consider using things written in Ruby for anything productive
docker and vagrant were initially not designed with security in mind when it came to downloading images - only in the last one or two years have there actually been solutions available that even do the most basic cryptographic verification
the entire node.js ecosystem mess you're currently describing
Then the next new fad comes along in the development world and everything starts over again. The only thing that's unclear is what the next hype following this pattern is going to be.
And I also want to quote myself with what I think are some things you could do to improve this situation:
- You want to make sure your build is reproducible (in the most basic sense of being re-buildable from scratch), that you are not building upon scraps of code that nobody knows where they came from, or which version they are. If possible, at the package level don't vendor dependencies, depend on the user having the other dependencies pre-installed. Building should be a mainly automatic task. Automation tools then can take care of that (cpan, pip, npm).
- By doing this you are playing well with distributions, your software becomes available to people that can not live on the bleeding edge, and need to trust the traceability of the source code, stability, patching of security issues, etc.
- If you must download a binary blob, for example what Debian non-free does for Adobe Flashplayer, then for the love of all that is sacred, use TLS and verify checksums!
This is the fifth -and last- part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.
Other lessons learned
Apart from these guidelines for SRE I have described in previous posts, there were a few other practices I learned at google that I believe could benefit most operations teams out there (and developers too!).
It's worth repeating, it is not sustainable to do things manually. When your manual work grows linearly with your usage, either your service is not growing enough, or you will need to hire more people than there are available in the market.
Involve Ops in systems design
Nobody knows better than Ops the challenges of a real production environment, getting early feedback from your operations team can save you a lot of trouble down the road. They might give you insights about scalability, capacity planning, redundancy, and the very real limitations of hardware.
Give Ops time and space to innovate
If there is so much emphasis on limiting the amount of operational load put on SRE, that freed time is not only used for creating monitoring dashboards or writing documentation.
As I said in the beginning, SRE is an engineering team. And when engineers are not running around putting fires off all day long, they get creative. Many interesting innovations come from SREs even when there is no requirement for them.
Write design documents
Design documents are not only useful for writing software. Fleshing out the details of any system on paper is a very useful way to find problems early in the process, communicate with your team exactly what are you building, and convince them why it is a great idea.
Review your code, version your changes
This is something that is ubiquitous at Google, and not exclusive to SRE: everything is committed into source control: big systems, small scripts, configuration files. It might not seem worth it at the beginning, but having complete history of all your production environment is invaluable.
And with source control, another rule comes into play: no change is committed without first being reviewed by a peer. It can be frustrating having to find a reviewer and wait for the approval for every small change, and some times reviews will be suboptimal, but once you get used to it, the benefits greatly out-weights any annoyances.
Before touching, measure
Monitoring is not only for sending alerts, it is an essential tool of SRE. It allows you to gauge your availability, evaluate trends, forecast growth, perform forensic analysis.
Also very importantly, it should give you the data needed to take decisions based on reality and not on guesses. Before optimising, find really where the bottleneck is; do not decide which database instance to use until you saw the utilisation of the past few months, etc.
And speaking of that... We really need to have a chat about your current monitoring system. But that's for another day.
I hope I did not bore you too much with these walls of text! Some people told me directly they were enjoying these posts, so soon I will be writing more about related topics. In particular I would like to write about modern monitoring and Prometheus.
I would love to hear your comments, questions, and criticisms.
This is the fourth part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.
As a side note, I am writing this from Heathrow Airport, about to board my plane to EZE. Buenos Aires, here I come!
So, what does SRE actually do?
Enough about how to keep the holy wars under control and how to get work away from Ops. Let's talk about some of the things that SRE does.
Obviously, SRE runs your service, performing all the traditional SysAdmin duties needed for your cat pictures to reach their consumers: provisioning, configuration, resource allocation, etc.
They use specialised tools to monitor the service, and get alerted as soon as a problem is detected. They are also the ones waking up to fix your bugs in the middle of the night.
But that is not the whole story: reliability is not only impacted by new launches. Suddenly usage grows, hardware fails, networks lose packets, solar flares flip your bits... When working at scale, things are going to break more often than not.
You want these breakages to affect your availability as little as possible, there are three strategies you can apply to this end: minimise the impact of each failure, recover quickly, and avoid repetition.
And of course, the best strategy of them all: preventing outages from happening at all.
Minimizing the impact of failures
A single failure that takes the whole service down will affect severely your availability, so the first strategy as an SRE is to make sure your service is fragmented and distributed across what is called "failure domains". If the data center catches fire, you are going to have a bad day, but it is a lot better if only a fraction of your users depend on that one data center, while the others keep happily browsing cat pictures.
SREs spend a lot of their time planning and deploying systems that span the globe to maximise redundancy while keeping latencies at reasonable levels.
Many times, retrying after a failure is actually the best option. So another strategy is to automatically restart in milliseconds any piece of your system that fails. This way, less users are affected, while a human has time to investigate and fix the real problem.
If a human needs to intervene, it is crucial that they get notified as fast as possible, and that they have quick access to all the information that is needed to solve the problem: detailed documentation of the production environment, meaningful and extensive monitoring, play-books1, etc. After all, at 2 AM you probably don't remember in which country the database shard lives, or what were the series of commands to redirect all the traffic to a different location.
Implementing monitoring and automated alerts, writing documentation and play-books, and practising for disaster are other areas where SREs devote much effort.
Outages happen, pages ring, problems get fixed, but it should always be a chance for learning and improving. A page should require a human to think about a problem and find a solution. If the same problem keeps appearing, the human is not needed any more: the problem needs to be fixed at the root.
Another key aspect of dealing with incidents is to write post-mortems2. Every incident should have one, and these should be tools for learning, not finger-pointing. Post-mortems can be an excellent tool, if people are honest about their mistakes, they are not used to blame other people, and issues are followed up by bug reports and discussions.
Of course nobody can prevent hard drives from failing, but there are certain classes of outages that can be forecasted with careful analysis.
Usage spikes can bring a service down, but an SRE team will ensure that the systems are load-tested at higher-than-normal rates. They could also be prepared to quickly scale the service, provided the monitoring system will alert as soon as a certain threshold is reached.
Monitoring is actually the key part in this: measuring relevant metrics for all the components of a system, and following their trends over time, SREs can completely avoid many outages: latencies growing out of acceptable bounds, disks filling up, progressive degradation of components, are all examples of typical problems automatically monitored (and alerted on) in an SRE team.
This is getting pretty long, so the next post will be the last one of these series, with some extra tips from SRE experience that I am sure can be applied in many places.
A play-book is a document containing critical information needed when dealing with a specific alert: possible causes, hints for troubleshooting, links to more documentation. Some monitoring systems will automatically add a play-book URL to every alert sent, and you should be doing it too. ↩
Similarly to their real-life counterparts, a post-mortem is the process of examining a dead body (the outage), gutting it out and trying to understand what caused its demise. ↩
This is the third part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.
In this post, I talk about some strategies to avoid drowning SRE people in operational work.
SREs are not firefighters
As I said before, it is very important that there is trust and respect between Dev and Ops. If Ops is seen as just an army of firefighters that will gladly put away fires at any time of the night, there are less incentives to make good software. Conversely, if Dev is shielded from the realities of the production environment, Ops will tend to think of them as delusional and untrustworthy.
Common staffing pool for SRE and SWE
The first tool to combat this at Google -and possibly a difficult one to implement in small organisations- is to have single headcount budgets for Dev and Ops. That means that the more SREs you need to support your service, the less developers you have to write it.
Combined with this, Google offers the possibility to SREs to move freely between teams, or even to transfer out to SWE. Because of this, a service that is painful to support will see the most senior SREs leaving and will only be able to afford less experienced hires.
All this is a strong incentive to write good quality software, to work closely and to listen to the Ops people.
Share 5% of operational work with the SWE team
On the other hand, it is necessary that developers see first hand how the service works in production, understand the problems, and share the pain of things failing.
To this end, SWEs are expected to take on a small fraction of the operational work from SRE: handling tickets, being on-call, performing deployments, or managing capacity. This results in better communication among the teams and a common understanding of priorities.
Cap operational load at 50%
One very uncommon rule from SRE at Google, is that SREs are not supposed to spend more than half of their time on "operational work".
SREs are supposed to be spending their time on automation, monitoring, forecasting growth... Not on repeatedly fixing manually issues that stem from bad systems.
Excess operational work overflows to SWE
If an SRE team is found to be spending too much time on operational work, that extra load is automatically reassigned to the development team, so SRE can keep doing their non-operational duties.
On extreme cases, a service might be deemed too unstable to maintain, and SRE support is completely removed: it means the development team now has to carry pagers and do all the operational work themselves. It is a nuclear option, but the threat of it happening is a strong incentive to keep things sane.
The next post will be less about how to avoid unwanted work and more about the things that SRE actually do, and how these make things better.
This is the second part in a series of articles about SRE, based on the talk I gave in the Romanian Association for Better Software.
On the first part, I introduced briefly what is SRE. Today, I present some concrete ways in which SRE tried to make things better, by stopping the war between developers and SysAdmins.
Dev vs Ops: the eternal battle
So, it starts at looking at the problem: how to increase the reliability of the service? It turns out that some of the biggest sources of outages are new launches: a new feature that seemed innocuous somehow managed to bring the whole site down.
Devs want to launch, and Ops want to have a quiet weekend, and this is were the struggle begins. When launches are problematic, bureaucracy is put in place to minimise the risks: launch reviews, checklists, long-lived canaries. This is followed by development teams finding ways of side-stepping those hurdles. Nobody is happy.
One of the key aspects of SRE is to avoid this conflict completely, by changing the incentives, so these pressures between development and operations disappear. At Google, they achieve this with a few different strategies:
Have an SLA for your service
Before any service can be supported by SRE, it has to be determined what is the level of availability that it must achieve to make the users and the company happy: this is called the Service Level Agreement (SLA).
The SLA will define how availability is measured (for example, percentage of queries handled successfully in less than 50ms during the last quarter), and what is the minimum acceptable value for it (the Service Level Objective, or SLO). Note that this is a product decision, not a technical one.
This number is very important for an SRE team and its relationship with the developers. It is not taken lightly, and it should be measured and enforced rigorously (more on that later).
Only a few things on earth really require 100% availability (pacemakers, for example), and achieving really high availability is very costly. Most of us are dealing with more mundane affairs, and in the case of websites, there are many other things that fail pretty often: network glitches, OS freezes, browsers being slow, etc.
So an SLO of 100% is almost never a good idea, and in most cases it is impossible to reach. In places like Google an SLO of "five nines" (99.999%) is not uncommon, and this means that the service can't fail completely for more than 5 minutes across a whole year!
Measure and report performance against SLA/SLO
Once you have a defined SLA and SLO, it is very important that these are monitored accurately and reported constantly. If you wait for the end of the quarter to produce a hand-made report, the SLA is useless, as you only know you broke it when it is too late.
You need automated and accurate monitoring of your service level, and this means that the SLA has to be concrete and actionable. Fuzzy requirements that can't be measured are just a waste of time.
This is a very important tool for SRE, as it allows to see the progression of the service over time, detect capacity issues before they become outages, and at the same time show how much downtime can be taken without breaking the SLA. Which brings us to one core aspect of SRE:
Use error budgets and gate launches on them
If SLO is the minimum rate of availability, then the result of calculating
1 - SLO is what fraction of the time a service can fail without failing out
of the SLA. This is called an error budget, and you get to use it the way you
If the service is flaky (e.g. it fails consistently 1 of every 10000 requests), most of that budget is just wasted and you won't have any margin for launching riskier changes.
On the other hand, a stable service that does not eat the budget away gives you the chance to bet part of it on releasing more often, and getting your new features quicker to the user.
The moment the error budget is spent, no more launches are allowed until the average goes back out of the red.
Once everyone can see how the service is performing against this agreed contract, many of the traditional sources of conflict between development and operations just disappear.
If the service is working as intended, then SRE does not need to interfere on new feature launches: SRE trusts the developers' judgement. Instead of stopping a launch because it seems risky or under-tested, there are hard numbers that take the decisions for you.
Traditionally, Devs get frustrated when they want to release, but Ops won't accept it. Ops thinks there will be problems, but it is difficult to back this feeling with hard data. This fuels resentment and distrust, and management is never pleased. Using error budgets based on already established SLAs means there is nobody to get upset at: SRE does not need to play bad cop, and SWE is free to innovate as much as they want, as long as things don't break.
At the same time, this provides a strong incentive for developers to avoid risking their budget in poorly-prepared launches, to perform staged deployments, and to make sure the error budget is not wasted by recurrent issues.
That's all for today. The next article will continue delving into how traditional tensions between Devs and Ops are played in the SRE world.
This blog is powered by ikiwiki.