Where DevOps and Site Reliability Engineers Intersect and Diverge

While DevOps teams and site reliability engineers (SREs) have both gained prominence in IT circles, the similarities and differences between the two aren’t always well-understood. They are closely aligned in the services they provide to their businesses, but there are clear lines of demarcation between the roles they play, the tool sets they use, and the way they are incentivized, both organizationally and internally. Here’s a quick overview:

Where are they focused?
Anything that is pre-production is DevOps, while post-production work is SRE. While DevOps is primarily focused on enablement of application development and production, SREs are much more focused on the stability, or reliability, of the platform once it is in production.

What tools do they use?
Given the differences in their goals, the toolkits they use are also dissimilar. DevOps teams are more focused on IT workflow and automation tools like Jenkins, Chef, Puppet and Harness. Cloud engineering and infrastructure as code platforms like Ansible, Hashicorp, and Pulumi are relied on, too.

SREs are focused more on monitoring, via Data Dog, Prometheus, and similar platforms. They are always on call, so PagerDuty or similar tools are critical to them. They must also be familiar with service level objective (SLO) and service level indicator (SLI) definition tools such as Blameless or Nobl9. These tools in combination give them the information they need to find those indicators and track and report against them.

Which is more technically demanding?

When it comes to training required and the overall technicality of the role, DevOps is likely to be more hands-on technical, given their need to know how to build a pipeline and maintain it in a way that meets the needs of a broad set of stakeholders.

SREs need to be more software-engineering-knowledgeable. Being able to diagnose issues and route them to the right people is critical in their world. While SREs don’t need to know the details of infrastructure provisioning, they do need to know how to determine when they are first seeing latency on a particular piece of cloud infrastructure, and why.

How did they get there?

When people are just starting their careers, they need to be flexible, and may not have a strong voice when joining a new organization. What they know, and how they can demonstrate it, will determine their roles. Either they are a platform-engineering expert and know a lot about how to build cloud platforms, or they know about monitoring usability. If their background has been in sysadmin, DevOps is probably a closer fit. It’s a natural progression from setting up your Linux VMs to automating the process. If, on the other hand, bringing some order to chaos is your thing, SRE is probably the path you want to go down.

What are their bad days?

So, what’s a bad day for DevOps or SREs? For an SRE, it’s fire after fire after fire. Especially in large organizations, SREs in many cases are the first line of defense. They’re on call. They’re doing triage. They’re rolling things back, doing whatever they need to do to get the service back up. When everything is on fire and you don’t even know who to escalate to, that’s a bad day for the SRE team.

For DevOps, a bad day is when Jenkins is down, and DevOps pipelines are not working. Someone releases a new change or migration, and then realizes that a critical service in the path hasn’t done the migration yet, and so that team is screaming at DevOps. When engineering teams can’t do their work because of something that DevOps did as part of a migration process, that’s a very bad day.

What are their great days?

The best thing that can happen to an SRE is recognition of pure business value. When someone’s boss says, “Okay, this quarter we saved $5 million in staff hours because we had 70% fewer outages, and 50% of our outages were auto resolved because of the run books that we put in place,” that’s a good day for an SRE.

A great day for DevOps is one of silence. When people are spinning up their infrastructure, deploying things, and everything is working the way it’s supposed to work, that’s a good day. When people can do whatever they need to do, pipelines are working and everything is a streamlined machine that’s chugging along, that’s a good day for a DevOps engineer.

Adding value every day

The last few years have given rise to hundreds, if not thousands, of new roles, terms, acronyms, platforms, and organizations, all pursuing the same goal — excellence and speed in software delivery. The term DevOps, coined more than a decade ago, means something very different today than it did then. Site reliability engineering, a newer, but similarly rapidly changing role, is gaining in prominence. Regardless of how they intersect and diverge, and how that changes within an individual company (and it does), these two roles sit at the center of the software lifecycle in of their organizations, and are only becoming more valuable, and more strategic, as time, and production, moves on.

https://www.informationweek.com/software/where-devops-and-site-reliability-engineers-intersect-and-diverge

Leave a Reply