With all the hype around platform engineering, there’s a lot of confusion about what it is and, perhaps more importantly, how it is different from more established disciplines like site reliability engineering (SRE) and DevOps. In fact, with many SREs and DevOps professionals moving into platform engineering roles, it’s easy to mistake them all as one and the same.
Is platform engineering just a title you can leverage for a salary bump? Is it DevOps rebranded? The differences are more subtle than you think, but they’re also what make platform engineering so important.
Let’s dive in.
“You build it, you run it.” In 2006, this is how Amazon’s CTO Werner Vogels described the company’s approach to software engineering. Amazon’s developers had abandoned the traditional “throw it over the wall” to operations model. Instead, they deployed and ran their applications and services end to end. And so, DevOps was born.
Over time, thought leaders came up with different metrics for organizations to gauge the success of their DevOps setup. The DevOps bible, “Accelerate,” established lead time, deployment frequency, change failure rate and mean time to recovery (MTTR) as standard metrics. Reports like the State of DevOps from Puppet and Humanitec’s DevOps benchmarking study used these metrics to compare top-performing organizations to low-performing organizations and deduce which practices contribute most to their degree of success.
DevOps unlocked new levels of productivity and efficiency for some software engineering teams. But for many organizations, DevOps adoption fell short of their lofty expectations.
Manuel Pais and Matthew Skelton documented these anti-patterns in their book “DevOps Topologies.” In one scenario, an organization tries to implement true DevOps and removes dedicated operations roles. Developers are now responsible for infrastructure, managing environments, monitoring, etc., in addition to their previous workload. Often senior developers bear the brunt of this shift, either by doing the work themselves or by assisting their junior colleagues.
This so-called “shadow operations” anti-pattern inefficiently allocates organizations’ most expensive and talented resources. This is a prevalent problem among organizations struggling with their DevOps setup. Humanitec’s DevOps benchmarking study found that while 100% of top-performing organizations reported that their developers can do all DevOps tasks on their own, 44% of low-performing organizations have shadow operations.
Shadow operations are indicative of a larger problem with DevOps done wrong: too much cognitive load on developers. Cognitive load is the amount of information a person must process to complete a task. When cognitive load is too high, developers aren’t able to retain and process all of the information they need to complete their work.
Some organizations used DevOps to shift everything onto developers, but many developers didn’t want to do ops or couldn’t juggle the additional responsibilities. Haphazard DevOps adoption increased the volume and complexity of tools and workflows developers had to interact with. Microservice architecture in a cloud native setup often requires knowledge of Kubernetes, configuration management, infrastructure provisioning and more. All of this creates cognitive load and gets in the way of developers completing the most important task: delivering features.
DevOps culture demonstrated that developer self-service can increase productivity and efficiency. At the same time, the cognitive load on developers that resulted for many organizations highlights the need for setups to provide some structure, standardization and the right level of abstraction.
Site Reliability Engineering
SRE was invented and popularized by Google. Like DevOps, it was a cultural shift that got a lot of hype.
According to Benjamin Treynor Sloss, SREs are responsible for the “availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning of their service(s).” They use service-level objectives (SLOs) and error budgets to set shared expectations for performance and balance reliability with innovation, respectively.
There’s nothing wrong with SRE in theory. But site reliability engineering adopted incorrectly can cause a lot of problems, especially for organizations that lack the same resources or talent that Google does.
Hiring SREs is hard and expensive. Many organizations fail to hire experienced-enough SREs to meet the needs of their setup. As a result, some ops people take on these responsibilities. SREs in these organizations are forced to focus more on survival than improving the developer experience. They don’t have time to enable developer self-service or improve architecture or tooling.
This “fake” SRE becomes an incredibly restrictive role, reminiscent of the pre-DevOps approach to software engineering. “DevOps Topologies” summarizes this well: “Devs still throw software that is only ‘feature-complete’ over the wall to SREs. Software operability still suffers because devs are no closer to actually running the software that they build, and the SREs still don’t have time to engage with devs to fix problems when they arise.”
All of this history leads us to platform engineering.
Luca Galante defines platform engineering as the “discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud native era. Platform engineers provide an integrated product most often referred to as an ‘Internal Developer Platform’ covering the operation necessities of the entire life cycle of an application.”
Internal developer platforms, and the teams that build them, are important because they improve their organizations’ performance. The strong relationship between the use of internal developer platforms and degree of DevOps evolution is illustrated in Puppet’s 2020 and 2021 State of DevOps reports.
Humanitec CTO Chris Stephenson describes an IDP as “a bridge between the developers and the platform teams so that developers can do their job without being blocked by operations. On the other side, operations can then ensure standards are applied and things are scalable without having to put that on the developer’s shoulders.”
Good platforms mitigate the problems that arise from poorly adopted DevOps and SRE. Where DevOps create too much cognitive load for developers, platform engineering seeks to alleviate it by finding the right level of abstraction and paving golden paths. Where fake SRE tend to create bottlenecks for developers, platform engineering prioritizes developer self-service and freedom.
This improvement can be attributed to platform engineering’s product mindset. In his talk for PlatformCon 2022, “Team Topologies” co-author Manuel Pais explained the platform-as-a-product approach. Platforms are like products in that they rely on voluntary adoption, are designed for ease of use and change alongside technology. Therefore, the principles and processes that apply to products should also apply to platforms.
In practice, this means conducting user research, creating a product roadmap, soliciting regular feedback, iterating, launching the platform and marketing it internally to your developers. This process helps platform teams avoid common pitfalls: becoming a glorified help desk, failing to get sufficient internal buy-in for the platform, building tools developers don’t want to use and so on. It also forces platform teams to strike the right balance between developer freedom and abstracting away complexity for their specific organization and its developers.
This approach ensures that your platform solves your developers’ problems in a way that genuinely makes their lives easier. It also increases your awareness of the constraints your organization faces, preventing the haphazard implementation of something like SRE just because it works for Google.
Platform engineering is the next stage of the DevOps evolution. It encompasses the best intentions of the cultural shifts that came before it. Like DevOps, it enables developer self-service. Like SRE, it reduces errors and increases the reliability of shipping.
But can’t platform engineering also be done badly? Certainly. That’s why the platform engineering community was created. Through webinars, roundtables and virtual conferences like PlatformCon, this global community has found a way to discover and share best practices for teams at every stage of the platform journey.
Platform engineering is going to be the next big thing. Y’all heard it here first.
Want to get in on the hype? Join over 5000 platform engineers on our community Slack channel.
Group Created with Sketch.