InfoQ Homepage Articles How Platform and Site Reliability Engineering Are Evolving DevOps

How Platform and Site Reliability Engineering Are Evolving DevOps

Feb 06, 2024 8 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

DevOps, platform engineering, and SREs all need to be proficient in coding. They must stay up-to-date with new developments in their organization and among the cloud and other providers they work with.
While DevOps engineers don't need to focus on systems engineering, platform engineers and SREs do. SREs need to manage offerings at scale across multiple cloud providers, and platform engineers must understand systems in the context of the tools developers use.
Crisis management is a key skill for platform engineers and SREs, with SREs needing to efficiently fix and debug issues during multiple failures of a business-critical application.
Communication skills are essential for all engineers. This is especially true for SREs and platform engineers, as they interact with various groups with different risk tolerances and stakes.
Platform engineers need product management skills or a dedicated Product Manager, as they are responsible for building an internal base for development teams and must understand changing requirements and use cases over time

The DevOps model has morphed from "nice to have" to "must have" for any company that needs to move quickly from idea to production to product in users’ hands. In fact, DevOps is table stakes. Companies are now looking to grow and more effectively manage DevOps with platform engineering and SRE (site reliability engineering) staff.

No one has these roles perfectly carved out right now - there’s just too much to do and not enough people to do it - but knowing where these three disciplines do and don’t overlap will help organizations evolve and take advantage when they are ready.

You might say that platform engineering is the next big thing in the DevOps world. Platform engineering is not brand new; it’s an emerging model focused on not only simplifying developers’ lives, but also empowering them with a set of carefully curated gold-standard tools.

The goal of site reliability engineering, meanwhile, is to use software and automation to solve problems and manage production systems.

In fact, SREs ideally split their time between operations tasks and development work, such as creating tools, scaling systems, and implementing automation.

The Virtuous (and Variable) Venn Diagram

It’s helpful to think of DevOps, platform engineering and site reliability engineering roles in the context of a Venn diagram.

In the Venn diagram pictured below, I have included the granular skills required in, between, and among these three circles.

There are five macro categories to help understand these roles.

1. Development

DevOps teams, platform engineers, and SREs all need to know how to write code, leveraging cloud-native tools such as containers and Kubernetes. But even if they are all experts in the languages, practices, and models currently in use at their organizations, they have to be ready for - and open-minded about - whatever comes next.

2. Systems Engineering

The typical DevOps engineer does not have to worry about systems engineering. The typical platform engineer and site reliability engineer do - but not to the same extent.

3. Crisis Management

This is a skill set that might not come immediately to mind, but the ability to effectively manage a crisis is another key capability for both platform engineers and SREs, especially SREs. Folks on the DevOps team might feel differently, but crisis management is not a key area of concern for them.

4. General Communications Skills

It may seem obvious to say this (at least, I hope it is), but communication skills matter for every engineer. For SREs and platform engineers, such skills are an absolute necessity because both groups interact with many different groups of people with many different tolerances for risk. For DevOps to work, there needs to be communication between, well, dev and ops. In general, the better the communications, the more effectively ideas move from theory to design and beyond.

5. Product Management Skills

DevOps teams don’t really need to worry about product management skills, nor do SREs. Platform engineers, however, are all about the platform and, therefore, must be all about managing the platform.

Pick Role-ing

As you’re reading this, you may be thinking: I don’t have those three roles in the company, or I have people performing these functions, but not broken out in the way described here.

That’s fine, and more common than not. DevOps, platform engineering and site reliability engineering comprise a continuum, not a hierarchy, and the PE and SRE roles evolve from a strong DevOps foundation.

Smaller companies, for example, may not have enough staff to fill all of these roles, so people end up wearing multiple hats. But as companies mature and grow, it’s important to zoom out and identify when it makes sense to dedicate specific resources. Do you have multiple - like, 20 to 30 - development teams? Platform engineers can identify commonalities across all of them to determine which tools make sense and what tasks can be automated. Are you scaling and scaling fast? You likely need a site reliability engineering function to keep up.

Having said that, it’s important to note the key phrase "dedicate specific resources." All too often, organizations realize the need for a platform or site reliability engineering and just give people new titles or send them to classes without any real-world context or support.
Neither of these actions alone - or even combined - will be enough.

Building a true platform or site reliability engineering practice requires a purposeful and positive culture shift, with buy-in from top to bottom (and from bottom to top).

Failure Is an Option

As part of this culture shift, it should be OK to fail. Actually, failure should not just be OK but welcome. Most organizations are averse to failure, but it’s only through our failures in these spaces that we can learn and grow and figure out how to best position, leverage, and continue to imagine the roles of DevOps, platform engineers, and SRE.

I’ve seen this play out in large companies that went all in on DevOps and then realized that they needed a team focused on breaking down any barriers that presented themselves to developers.

At scale, DevOps - even with the tools provided by the internally focused platform engineering team - didn’t really cut it. These companies then integrated the SRE function, which filled DevOps’ reliability and scalability gaps.

That worked until these companies realized that they were reinventing the wheel - dozens of times. Different engineering teams within the organization were doing things just differently enough - different setups, different processes, different expectations - that they needed separate setups to put out a service. The SREs were seeing all of this after the fact, which led them to circle back to the realization that different teams needed to be using the same development building blocks. (Back to you, platform engineers.)

Frustrating? Yes. The cost of increasing efficiency in the future? Absolutely.

Indeed, it’s really important that there’s no finger-pointing or wallowing in regret. DevOps doesn’t work at scale? Fine, let’s bring in SRE to help us manage our expanding application portfolio and customer base. It’s a good problem to have! Ten different engineering organizations are doing things in ten different ways? OK, let’s get platform engineering on the job.

No matter how things ebb and flow - and they will ebb and flow - assume and communicate positive intentions. After all, these are bits and bytes we’re dealing with - if the bits and bytes don’t work a certain way, we acknowledge it, rearrange them, and move on.

What will make the DevOps, platform engineering, and SRE functions successful - as they exist on their own and in all of the areas in which they overlap - is the ability to learn from failure and to adapt and change quickly. Disagreeing then committing - as opposed to succumbing to analysis paralysis - allows organizations to keep up a rapid development pace while ensuring that systems are reliable and that the right tools, processes, and people are in place.

Your Next Great DevOps, Platform, and Site Reliability Engineers

Speaking of people, it’s vital to give people opportunities to grow and learn, no matter what their role. Just as you can’t make someone, say, a site reliability engineer just by giving them the title, you can’t discount someone as an SRE just because they don’t have the title (or even the specific training). This goes for DevOps and platform engineering roles, as well.

The people who are the best fit in any of these roles are curious, have a deep understanding of technology, and have demonstrated the ability to effectively collaborate with team members. Look for people who want and are willing to learn. Aligning existing and potential skill sets and mindsets with the key characteristics noted in our Venn diagram can help identify - and retain - your next DevOps, platform, and site reliability engineers.

To be honest, I should have made the Venn diagram graphic interactive, for two reasons: One, the way the functions are distributed among the roles will not be the same for every company. For startups or small companies, for example, the three circles may completely overlap each other. Two, even in a large enterprise where there are very discrete business units and technology functions, things change. The economy changes. Technology changes. People change. And, so, too, will the way the functions fall into place.

There are many factors that will figure into the questions companies should ask themselves to determine which roles make sense and when. But it all boils down to taking stock of applications and what teams are working on. This will enable organizations to proactively determine where roles start to overlap and when breaking out discrete PE and SRE functions from DevOps makes sense.

Conclusion

As the world became digitized, the demand for new apps and services rapidly increased. The DevOps model enabled organizations to meet that demand by accelerating the process of moving ideas from development to deployment. Indeed, in the last decade or so, the DevOps model has delivered increased business value and responsiveness through rapid, high-quality service delivery.

The disciplines of platform engineering and site reliability engineering have emerged to meet the challenge of improving the DevOps process. Dedicating staff to each of these roles may be out of reach today, but building awareness of how the roles complement each other and identifying required skill sets from among existing staff will position organizations to take advantage of the next big things in DevOps.

About the Author

Narayanan Raghavan

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

How Platform and Site Reliability Engineering Are Evolving DevOps

InfoQ Article Contest

Key Takeaways

Related Sponsored Content

The Virtuous (and Variable) Venn Diagram

Pick Role-ing

Failure Is an Option

Your Next Great DevOps, Platform, and Site Reliability Engineers

Conclusion

About the Author

Narayanan Raghavan

Rate this Article

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter