Site Reliability Engineering

Introduction

Towards the end of 2019, the term Site Reliability Engineering (SRE) has quickly been growing in the IT Services and DevOps domains. It might be the first time that you are hearing about SRE, so I thought it would be a good idea to write down the basic ideas and concepts. With this article, you will be up to speed on some fundamental SRE basics in under five minutes.

What is Site Reliability Engineering?

Site Reliability Engineering is a term that is quickly growing to prominence, mainly because it is the main operating model for IT Service Management at Google. From around 2016 onwards, Google started with the creation of SRE-teams to manage production systems. A great way to explain Site Reliability Engineering is explained in the Site Reliability Book, which was written by Jones, Petoff and Murphy[1]:

“Site Reliability Engineering is what happens when you ask a software engineer to design an operations team”

Although this is obviously not an official definition, I think it highlights a core focal point of Site Reliability Engineering: it applies (software) engineering best practices towards IT operations. If you are familiar with software engineering, you will now that this domain contains many problem-solving techniques, from debugging to root cause analysis. Above all else, software engineering requires problem-solving attitude and patience.

The approach of integrating Development (Dev) best practices into IT Operations might already sound familiar to you. Because is this not exactly the same concept that we have seen before, and which we have previously labelled DevOps? A more careful comparison between Site Reliability Engineering and DevOps shows that there are some subtle, yet fundamental changes between the two concepts.

Site Reliability Engineering vs. DevOps

The difference between Site Reliability and DevOps is best explained by reviewing the origination of both concepts. DevOps, from its early beginnings in 2008, originated as a cultural movement that involved the IT function from in each’s phase of a system’s design and development. Over the years, it has grown to become a set of practices that automates the processes between software development and IT teams, in order that they can build, test, and release software faster and more reliably.[2]

But more importantly, the concept of DevOps is founded on building a culture of collaboration between teams that historically functioned in relative siloes. The emphasis on a collaborative culture is one of the most important characteristics of the DevOps movement. And the strength of DevOps – which resulted in a rapid adoption all over the planet – might also be its largest weakness. Since DevOps is a cultural movement, some argue that there is no uniform definition of the methodology, and that DevOps is therefore always ‘in flux’.[3] Whether you share this opinion or not, it does explain why the concept of Site Reliability Engineering came to light.

Site Reliability Engineering embraces most of the fundamental concepts – such as automation, collaboration and quality – of DevOps, but augments it witch some more measurable targets, which are especially relevant in an enterprise context. As the Site Reliably Engineering book explains the difference:

“One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.”

So what are these idiosyncratic extensions, and how does this make it different? These extensions might be best understood if we take a closer look at the definition of an SRE team at Google. SRE teams at Google are responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). This definition results into a number of core responsibilities, that can be efficiently monitored and improved in an enterprise context, and which makes SRE a practical approach.

Key characteristics of Site Reliability Engineering

When we break down Google’s definition of a Site Reliability Engineering team, we can identify some core properties of SRE that help us to understand it most fundamental aspects. Unlike the DevOps ‘movement’ with its strong focus on culture, SRE characteristics include practical steps that teams can take to remove any barriers between the world of software engineering and running a secure production environment.

After a review of some of the main publications on Site Reliability Engineering, the following key characteristics emerge defining pillars of SRE.

  1. Engineer towards Ops. SRE has a strong focus on using (software) engineering methodologies, such as design and problem-solving practices. Google’s description of SRE as ‘the Sysadmin Approach to Service Management’ exemplifies this. The strong emphasis on engineering fault-free solutions, with continual feedback loops to other team members, results in the collaborative culture that DevOps describes. A good example to illustrate the ‘Engineer-to-Ops’ principle is Google’s blame free post-mortem culture, which aims to fix faults by applying engineering, rather than avoiding them.
  2. End-to-End Reliability. Where traditional Service Management organizations still tend to focus on customer SLAs or (interdepartmental) OLAs, Site Reliability Engineering focuses on another core metric: reliability. Reliability (the ‘R’ in SRE) is defined by Service Level Indicators (which are quantifiable measures for service reliability), and subsequent Service Level Objectives (which set reliability targets for SLIs). With its strong focus on reliability, SRE aims to find the middle ground between agility (the developers’ side) and stability (the operations’ side). More importantly, reliability is a customer-metric, and in today’s 24×7 ‘always-on’ economy, reliability has become almost the only thing customers care about.
  3. Monitoring of IT Assets. The only way in which end-to-end reliability can be achieved is if your organization is always in control. And in order to be in control of your services, you first need to have control over your underlying IT assets. The third key characteristic therefore focuses on the constant identification and monitoring of IT assets. Additionally, subsequent follow-up actions, such as emergency response and escalation management, need to be firmly structured.
  4. ‘Any Change’ Management. Organizations that are already working with Site Reliability Engineering have found that roughly 70% of outages are due to changes in a live system. For most people who have been working services, this will hardly come as a surprise. Therefore, anything that might cause ‘any change’ to a live system, should be carefully managed and controlled. Whether this is more accurate demand forecasting, improved capacity planning or more predictable load balancing, SRE focuses on any aspect that might potentially disrupt an IT environment.

The four key characteristics above provide an introductory summary that will help you understand the main focus of Site Reliability Engineering. Although there are many more subtopics that can be explained in detail – and I encourage everyone to review the SRE book – you will hopefully now have a basic understanding about the most important aspects of SRE.

Learn more about SRE

If you want to learn more about Site Reliability Engineering, Cybiant just launched a 2-day ‘SRE Foundation‘ training and certification program.


[1] Beyer, B., Jones, C., Petoff, J. and Murphy, N.R., 2016. Site Reliability Engineering: How Google Runs Production Systems. ” O’Reilly Media, Inc.

[2] Ebert, C., Gallardo, G., Hernantes, J. and Serrano, N., 2016. DevOps. Ieee Software, 33(3), pp.94-100.

[3] Beyer, B., Murphy, N.R., Rensin, D.K., Kawahara, K. and Thorne, S., 2018. The site reliability workbook: Practical ways to implement SRE. ” O’Reilly Media, Inc.”.

Leave A Comment

Receive the latest news in your email

Get Social