Why is Site Reliability Engineering Important?

Site reliability engineering (SRE) is one of the fastest-growing enterprise roles and set of operational practices for managing services at scale.

During the Site Reliability Engineering SKILup Day June 18, I asked a group of the DevOps Institute Ambassadors why SRE was important to them.

Below you will find some insightful responses, tips and calls to action. If you missed SRE SKILup Day or wish to rewatch any of the sessions on-demand, visit the conference landing page and access the videos for free: https://devopsinstitute.com/site-reliability-engineering-skilup-day/.

Happy Reading!

Anshul Lalit
DevOps Institute Ambassador from Kristiansand, Norway

Sponsorships Available

“The term SRE surely has been introduced by Google, but directly or indirectly several companies have been doing stuff related to SRE for a long time, though I must say that Google gave it a new direction after coining the term ‘SRE.’ I have a clear view on SRE as I believe it walks hand-in-hand with DevOps. All your infrastructure, operations, monitoring, performance, scalability and reliability factors are accounted for in a nice, lean and automated system (preferably); however this is not enough. Culture is an important aspect driving the SRE aspects, along with business needs. As the norm ‘to each, his own’ goes, SRE is no different. It is easy to get inspired from pioneer companies, but it’s impossible to copy their culture and means to replicate the success, especially with your ‘anti-patterns’ and ‘traditional’ remedial baggage. Do you have similar infrastructure and business needs as the company showcasing brilliant success with SRE? No. Can it help you? Absolutely. The key factor here is to recognize what is important to your success blueprint after understanding the fundamentals of it and find your own success factors considering your cultural needs. Your strategy and culture need to walk together, just like your guiding (strategy) and driving (culture) factors.”

Maciej Jarosz
DevOps Institute Ambassador from Warsaw, Poland

“With all of those methodologies and frameworks out there, one can only ponder: ‘Which one should I choose?’ Well, all of those frameworks and methodologies provide something of value, as long as you have an opportunity to actually test some of those concepts in your own local environment. Consider SRE for example: It worked for Google, sure. If anyone doubts it there are at least three books on the topic with the materials, proof and philosophy behind SRE. Even if your company is not the scale of Google you still should take a look at best practices that worked in 2017, 2018, 2019 and that are still working now. Considering the crazy times that we live in right now and the plethora of various free materials, conferences, blog posts, research whitepapers and videos on the topic, what are we waiting for? The knowledge is right here, ripe for the taking. Advancing the humanity of DevOps is just a matter of experimentation, trust and continuous improvement. And with events such as the SKILup Day: Site Reliability Engineering where you can network, exchange field stories, brag about your ‘battle scars’ and participate in some splendid speeches, it’s only easier to jump into the fascinating world of SRE discipline.

Helen Beal
DevOps Institute Ambassador from Chichester, England

“I might be a bit biased since I’ve been focused on DevOps for nearly a decade as a consultant, coach and learning facilitator, but for me there are a number of systems of thought that contribute to the DevOps body of knowledge. I stole (with permission!) a phrase I heard from DevOps Institute’s CEO, Jayne Groll, many years ago that DevOps is the ‘harmonious, polygamous marriage of ITSM, lean and agile’ and I’ve also presented at several events over years around the correlations between DevOps and Holacracy. At an event in Munich a few years back, John Willis and I started a conversation about a DevOps Superpattern as he also added his ‘three-legged stool’ of The Theory of Constraints, Safety Culture and Learning Organizations’. For me, SRE is another valuable system of thought that we can learn from and use to make practical changes to the way teams and organizations work to drive higher organizational performance.”

Mark Peters
DevOps Institute Ambassador from Texas, USA

“I recently saw a social media image for a shirt which broke out SRE as ‘Simply Restart Everything.’ While not entirely true, the concept of a site reliability engineer is important as another way to link Ops and Dev in the DevOps construct. SREs formalize a position whose job is to connect, to link folks together, and to streamline processes. Increasing reliability means to increase simplicity across multiple workflows. The bigger your organization, the more simple the process needs to be. Unfortunately, many organizations increase complexity, reduce transparency, and create wasted effort as they add more individuals. SREs are important in trying to understand processes from end to end and emphasizing outcomes rather than any particular process stage.”

Sean Davis
DevOps Institute Ambassador from Georgia, USA

“Site reliability engineering bridges the gaps between platform design, development and operational execution by providing new perspectives on system reliability. When you link software engineering and systems engineering mindsets with operational engineering, you streamline the support of a product-focused on business outcomes not ticket output. It can be a valuable gambit for the traditional approach that pays off big for companies. By trading off specializations in focusing on operational burden, you gain a larger view of the platform and reduce the efforts required to support the product and spend more time automating and innovating to move towards self-healing and autoscaling products instead of manual, reactionary troubleshooting.”

MIke D. Kail
DevOps Institute Ambassador from California, USA

“Instead of the word reliability, I prefer to use and approach it as resiliency, which then includes approaches and contributing members of the chaos engineering group that are hyper-focused on continually increasing the ability to recover from any and all failures, which we all know happen in both software and the underlying hardware. Resiliency also should resonate with the core Development teams as they should strive to continuously improve and collaborate with the SRE teams.”

James Wasson
DevOps Institute Ambassador from Oklahoma, USA

“We create pipelines, version control, and infrastructure as code like breathing now. That means we are done, right? The beauty of SRE is that it takes us right to where the value is made. It requires us to go back to the beginning and ask questions about decisions we need to make regarding a system. Site reliability engineers help us remove “resulting” from our processes. With the help of these principles, we begin to make near-real-time decisions based on the impacts to our customers, rather than the results of the past. We spend so much time creating amazing features, determining the velocity of our teams, and implementing the newest framework from that great book, but one of the most helpful questions we get from site reliability engineering is ‘How fast is too fast?’ If our organizations can answer this question early, we are able to make more informed decisions that create extended and meaningful value to our customers.“

Biswajit Mohapatra
DevOps Institute Ambassador from Pune, India

“The ever-changing digital and cloud landscape brings in an unprecedented need for collaboration, transparency, resiliency, stability, performance, reliability and correctness. SRE sets measurable engineering goals mapping to service level objectives (SLOs) and enables monitoring and tracking of these quality of service (QoS) parameters. SRE is a set of practices focused on reducing silos by shared ownership, planning for failures using error budgets, small-batch changes with focus on stability, automation of manual tasks and introducing a culture of measurement, monitoring and tracking. The fundamental goal of SRE is aimed at depicting a prescriptive approach to plan, build, implement, measure and achieve DevOps objectives with a focus on reliability and automation at every opportunity. SRE covers several critical areas of cloud platform architecture such as orchestrated automation, responsive operation, optimized performance, just-in-time scaling, modernized environment and predictive event management. It’s time to look at operations through the lens of software engineering practices. Leveraging SRE to design, build, operate and enhance software systems is critical to building the culture of risk-managed and reliable digital footprint for enterprises.”

Mitesh Soni
DevOps Institute Ambassador from Mumbai, India

“What vs How
Interface vs Class
Abstract Class vs Class
DevOps vs SRE

Similarity and difference lie in their positions. Ryan Foland has given a wonderful example to describe What and How: ‘What’ begins at the surface level, and usually describes things in a general sense while ‘How’ describes the process of something or the underlying steps and actions that are done to get results. I find this example a perfect fit for the relationship between DevOps and site reliability engineering (SRE). Both DevOps and SRE have created buzz and confusion and will continue to do so. DevOps is cultural change to transform existing practices with people, processes, and tools for a better quality outcome with speed. SRE is a concrete implementation of DevOps practices or principles. I would like to leave one question here. Let me form the base first: Agile and DevOps complement each other. Agile is like a train compartment and DevOps is like an engine. Is SRE like a train track?”

Marc Hornbeek
DevOps Institute Ambassador from Puerto Vallarta, Mexico

“Site reliability engineering (SRE) compliments DevOps by measuring and achieving reliability of applications and services working on production and DevOps infrastructure in a prescribed manner using error budgets, team relationships brokered by an error budget, Ops-as-code, and the use of reliability to control deployments. SRE emphasizes the following objectives that are important to DevOps engineering:

Assure availability of applications and services.
Assure error rate and latency meet Service Level Agreements (SLAs).
Enable large-scale systems while controlling risk.
Create operational cost savings by automation and policies.
Improve skills of developers and SREs.
Resolve conflicts between Dev and SRE.
Improve capacity planning and provisioning.”

Suma Puligella
DevOps Institute Ambassador from Virginia, USA

“SRE encompasses both software engineering and IT operations and bridges the gap between them. It ensures that the organization’s systems are scalable, stable, predictable and automated. SRE and DevOps seem like two sides of the same coin with a unified goal of enhancing the release cycle without any compromises. However, there are some differences between SRE and DevOps:

SRE covers architectural and design patterns along with reliability implementation techniques such as reactive programming. Whereas DevOps is a set of practices that combines software development (Dev) and IT Operations (Ops). DevOps also aims at continuous software delivery with high quality, reliability and shared ownership among software stakeholders.
DevOps raises the risks for the development team(s) to solve. Whereas SRE’s approach is to find problems and solve some of them themselves. SRE teams set service-level agreements (SLA) that define how reliable the systems need to be to end users. Their main goal is to improve performance and operational efficiency.
DevOps is all about the “What” needs to be done, SRE talks about “How” this can be done.

The bottom line is that it’s all about the data. You need information in order to understand how to measure success and failure and how to gain continuous reliability across the applications.”

Tiffany Jachja
DevOps Institute Ambassador from Maryland, USA

“Organizations are finding the best practices for delivering faster and meeting customer demands. Site reliability engineering is about building confidence that our systems work as expected. SRE defines the methods, concerns, and responsibilities needed to maintain or increase the reliability of a system. It improves our ability not only to deliver a product to end customers but also to do so in a sustainable way as an organization. Many workloads can be costly, complicated, or time-consuming for different reasons. These practices provide a foundation for the root analysis of our systems, at the end of the day we want to be able to set key goals, measure systems, and ensure a product or service fulfills its objectives. This is important for anyone with a stake in software development and delivery.”

Uchit Vyas
DevOps Institute Ambassador from Singapore

“Site reliability engineering is a discipline that seeks to create reliable IT systems by managing the rate and quality of change, ensuring clients protect themselves from any break in service and avoid being subject to fines and penalties.

As per my experience, SRE practices are mainly based around the following elements:

SLIs, SLOs and SLAs.
Monitoring.
Toil Definition.
Error Budgets.
Test Automation.
Capacity planning and measuring risk factors.

SRE implements a lot of the same practices as DevOps, but from an operations perspective focuses predominantly on operational stability and resilience of the production estate.”

Zuzana Homolyova
DevOps Institute Ambassador from Slovakia

“SRE works as a kind of connection bridge between development and operations as it cares about every process from the source code to deployment. It really focuses on how to bring scalable and reliable software systems into the real infrastructure, way, application, performance and security. To describe the difference between DevOps and SRE I would say: ‘DevOps asks what needs to be done but SRE asks how it can be done.’

What is really important is that DevOps and SRE do overlap each other. They share goals to understand how to measure success or failure and how to gain continuous reliability across every application.”

Craig Pearson
DevOps Institute Ambassador from Newcastle, England

“I’ve walked the line between DevOps and SRE in organizations and witnessed first-hand both good and bad practices. While there are a lot of SRE success stories, I’d like to focus on the occasional use of SRE as a justification to rebuild the walls around operations/production that DevOps strived for so long to take down. While adopting SLOs and focusing effort on toil reduction are obviously beneficial, taking a defensive position over the production environment is not. The trigger for this is often an issue impacting production—sometimes major, sometimes minor—or the perception that more audit rigor is needed in this new continuous delivery world. Whatever the trigger, sometimes SRE is proposed as the answer (R stands for reliability, right?) SRE has to shift-left and spread more of the wisdom of production to teams delivering value-adding products and services to that production environment. Deploying value rapidly is crucial. Doing this in a reliable manner is key. Both are valid reasons for introducing site reliability engineering—the future. However, keeping the wisdom of production in a separate (silo?) SRE team and withholding access to production is a backward step. A touch of back to the future maybe …”

Nadeem Augustine
DevOps Institute Ambassador from Cape Town, South Africa

“Having been exposed to DevOps and SRE, I do see a difference between the two roles. I often find the DevOps way of working focuses on how to bring the issues into development and a way to process it between teams. This creates a collaborative way of working in the organization. That is a pivotal part in rolling out DevOps. DevOps is the ‘what‘ that needs to be done in order to bridge the gap between development. Having played both roles as a ‘developer’ and ‘operations,’ you now see how SRE plays a deeper role in assisting the organization in achieving the end goal. As an SRE you focus on the ‘how‘ to resolve issues end to end with the quickest most stable resolution. Having resolved the issues you create a more reliable and repeatable process.”

Daniel Oh
DevOps Institute Ambassador and CNCF Ambassador from Boston, USA

“Who would be the most valuable resource to stabilize the cloud platform built on an immutable infrastructure (i.e. Kubernetes) for the Ops team? Here is the evidence that many recruiters continue to seek the roles that are named site reliability engineers (SRE) but also more IT Ops folks want to have SRE responsibilities in their careers.“

Hugo Lourenço
DevOps Institute Ambassador from Lisbon, Portugal

“How long does it take to find the product you need? How long does it take to get the product delivered? Does the product/service require training or expert assistance? Do you need other calls +, products +, services +, to make this product / service work? Does the product require external maintenance and you will be accountable to remember the provider when it’s needed? To get support assistance or any other contact with the provider consumes your valuable time. Your time? All these reasons were the reasons why I embraced agile, lean and automation = DevOps, DevOps is the only weapon in a digital era. I’m passionate about building products and services where consumers will pay to keep your company to exist and deliver such amazing products/services!”

Felipe Duenas
DevOps Institute Ambassador from Quito, Ecuador

“SRE is more than buzzwords and trends. It is far more about the people and process than the technology that underlies them. While technology certainly can (and likely will) adapt to the concept as it matures and more teams adopt it, you don’t need new tools to align your development and operations organizations around the principles of Site Reliability Engineering. SRE ’embodies the philosophies of DevOps with a greater focus on measuring and achieving reliability through engineering and operations work.’ SREs also ensure that everyone in the company agrees on how to measure reliability, and what to do when availability falls out of specification. This includes contributors at every level, from developers, through team managers and all the way up to VPs and executives.”

Suresh GP
DevOps Institute Ambassador from Singapore

“Site reliability engineering is redefining the value of infrastructure operations and its relevance in the wake of DevOps toolchain and deployment pipeline. After a long time, we see that this area of discipline is creating a profound impact on the holistic value chain by building world-class resilience and reliability of products and services with an end-end view. It also focuses on creating a sense of shared ownership and experimentation in challenging the status quo and delivering streamlined customer experience. I see a lot of synergy with Agile, DevOps, Lean, ITSM with SRE that compliments the dimension brilliantly. Last but not the least, cultural change is of phenomenal importance to make this transition happen and be sustainable. The principles and practices are so sound that the legacy environments can still take advantage of it and contribute to velocity increase and improved reliability.”

BMK
DevOps Institute Ambassador and CNCF Ambassador from Wellington, New Zealand

“Enterprises who run or want to run a cloud application at scale, need to adapt to SRE practices and disciplines. Merely stating a 99.99% SLA does not help, if you care about your customers and your customers’ experiences with your systems then, yes, we need to level-up our game, we need to understand the nitty-gritty of reliability of our systems, operations and more importantly understand the limitations and constraints within which we need to operate. Some useful pointers: 1) Caution: Do not simply rename your ‘system admins’ to ‘SRE’ 2) Upskill your ops folks to code 3) Upskill your dev folks to learn and know more about infrastructure and operations.”

Aparna Balasundar
DevOps Institute Ambassador, India

“If DevOps is the extension of agile to build continuous delivery pipelines, SRE is a natural extension of DevOps to build resilient, repeatable, reliable, scalable production systems. Shift left to include the SRE as part of the architecture design, build, deploy favoring long term viable systems. With the cultural mind shift, if SREs have to provide safe and reliable systems, developers have to dedicate building such systems. These joint working partnerships will benefit the overall business goal.”

Simone Jo Moore
DevOps Institute Ambassador from Occitanie, France

“Site Reliability Engineering has as many touchpoints across the service value stream as any other practice. I challenge you to map your interactions across your enterprise so the impacts of SRE to the value outcome are clearly visible. Those companies that treat these touchpoints as part of their beating heart competencies versus peripheral consideration, will continue to leap forward. The socio aspect is important here. Technology is a catalyst—people’s interaction with the tech has shifted, is shifting and continuing to shift. Security is embedded; there is no point at which practices (human and automated) do not exist in SRE enterprise-wide. The resources we apply depends on how big and complicated the system is.

It goes beyond technical feedback to being an advantageous feedback loop for improvement across business objectives with engagement and financial benefit. Think of SRE as a movie, yes I’m a sci-fi fan but think about movies like “Mission to Mars” or “The Martian” and even “Ad Astra.” In each of these, reliability also includes our trust in something being resilient. Even when things don’t go as expected, there are ways in which the system can be adapted to solve problems. Resiliency is about humans rather than technology, but think about the progression of AI. We want AI to become resilient in not about coming back to a baseline but move forward to a new baseline of its own accord. Its point of stasis is never the same yet reliable based on the quality of the engineering. Our point of stasis is never the same either, and the cycle continues.”