Managing Container and Kubernetes TCO

Over the past decade, we’ve seen massive growth in cloud Kubernetes adoption as organizations have raced to take advantage of the improved velocity and scalability, and reduced operational complexity, that Kubernetes promises for containerized applications. 

However, Kubernetes can become its own complex beast to tame, and two key total cost of ownership (TCO) issues can arise.  First, the time and specialist resources needed to manage (upgrade, optimize and debug) the Kubernetes environment grow over time and can be difficult to satisfy quickly and cost-effectively (through either recruiting or training). Secondly, far too many operations teams have experienced receiving the dreaded “surprise bills” driven by unexpected cloud Kubernetes expenses. A recent CNCF survey confirmed this pain, finding significant increases in cloud and Kubernetes-related bills for surveyed organizations, with 68% of respondents reporting that Kubernetes costs increased. Among those, half saw it jump more than 20% during the year.

These two TCO issues grow exponentially when we start to think about maturing from a single cloud Kubernetes cluster to a multi-cluster design or even the distributed, multi-location cluster management approach needed to address tens, hundreds, and potentially thousands, of distributed Kubernetes clusters. 

To an extent, these increases in TCO are a simple reflection of the increasing prevalence of Kubernetes deployments. As management of containerized workloads is maturing, it is tending from single cluster to centralized multi-cluster to distributed multi-cluster deployments. There is good reason for this–distributed deployments beat single clusters or centralized cloud in almost every important metric, from performance to reliability, availability, scalability, compliance and more. And of course, as compute resource utilization increases in a distributed environment, so do costs.

So, if distributed Kubernetes is better, what are the costs associated with running a distributed Kubernetes environment and how can we ameliorate those costs?

Distributed Kubernetes Costs

Operational Complexity

First and most fundamental, the operational complexity involved in managing a distributed multi-cluster deployment is exponentially greater than single-cluster or centralized cloud deployments. In most scenarios, this entails operating and managing your own distributed network, along with all the necessary failover planning, operations and telemetry management, endpoint selection, cloud coordination, routing and security management, etc. 

If, for reliability and redundancy, you employ multi-cloud as part of this distributed strategy, the complexity goes up yet again as you need to try to coordinate operational tasks and track costs across vendors. Each might host just a fraction or isolated geographical portion of your overall workload, but it’s still your job to deploy, optimize, track and manage workloads and costs across these disparate nodes and clusters. 

Simple but vital elements such as standardizing deploy pipelines or centralizing operational metrics become significantly more complex in distributed Kubernetes deployments. Managing this complexity adds overhead, and thus cost: in time, staff, tools, services, etc.  

Resource Utilization

Additionally, it can be difficult to efficiently manage resources, endpoints and cluster/container scaling across distributed systems, which can all impact cost. 

Resource management, for example, determines allocation of pods and nodes – how many workloads can you fit on a particular node given the available memory/CPU resources? Say you are paying for a node with 5 GB of memory and 5 CPUs, and your workload requires 2 GB and 2 CPUs… you can fit two pods of that workload per node. But that leaves part of the node underutilized, and conversely unable to handle another pod. Compounding the issue is that Kubernetes workloads don’t always use a fixed amount of memory and CPU usage, and in fact can vary from second to second.

Kubernetes provides parameters to manage resources in the form of Requests (minimum resource usage for a workload) and Limits (maximum resource usage per workload). Inappropriately using Requests and Limits–or even worse, not setting them at all–can deliver a path to runaway resource usage and thus cost.

Location Selection

Endpoint location selection and optimization is perhaps the most significant issue with respect to the hard costs associated with distributed hosting. The idea behind distributed deployment is to move workloads closer to users to maximize performance by minimizing latency, as well as to improve reliability and resilience through high availability. 

The first step in this process is to decide which endpoints you need to satisfy your performance or latency criteria.  

  • How many do you need?
  • Where are they located? 
  • Who is the provider? 
  • Are all resource allocations duplicated across regions? 
  • How is failover handled? 
  • And most importantly, are workloads always running in those regions or on demand? 

Addressing these questions by manually selecting endpoint locations and settling on a fixed delivery network invariably ends up with a sub-optimal balance of cost versus performance–either end-users are underserved or selected compute and locations are underutilized. 

Egress Cost

Often, one of the greater costs to consider (especially when using one of the hyperscalers) comes from data egress charges. Microservice architectures, by nature and design, tend to be very chatty. When you have a multi-AZ cluster architecture (whether single or multi-region), you end up with a lot of inter-AZ traffic, and those data egress costs quickly add up. 

While there’s a lot that can be done to optimize and right-size the capacity and resource requirements, it’s incredibly difficult to optimize egress cost. In fact, this is so difficult that we know of organizations that have specifically sought to minimize hyperscaler egress cost by deploying single-AZ clusters globally (i.e., each cluster is independent and isolated) and then load-balancing traffic as needed. That sort of “distributed” approach is functional, but has significant tradeoffs in terms of increased endpoint, traffic and operations management, and lacks the fundamental benefits of a distributed multi-cluster architecture.

Controlling Costs With a Clusterless Paradigm

A simple yet elegant answer to these cost issues is to adopt clusterless technology. A clusterless paradigm is the concept of running a coordinated “cluster of clusters” – in other words, treating a distributed multi-cluster deployment as though it were effectively one large cluster. This type of deployment, combined with multi-tenancy, and dynamic, location-aware systems on top of global networks, makes it possible for hosting providers to offer several key benefits in terms of cost control.

First, when a multitude of clusters is presented to the dev and operations teams as though it were a single Kubernetes cluster, this eliminates virtually all management complexity involved in distributed deployments. Teams no longer have to worry about individual location monitoring and telemetry collection, multi-cluster management, multiple deploy pipelines, network operations, load balancing, etc.–it’s all just available as needed, managed by the provider.  Importantly, a clusterless system can deliver automated failover across multiple providers so that resources scale, route or balance as needed when any single provider has issues. 

Second, in a clusterless system, is it much easier to definitively control resource cost when the control mechanism is centralized and consistent (i.e., a standard Kubernetes structure) across all target deploy locations. There is no need to monitor and manage resource utilization rules and parameters in individual systems or locations.  Limits on individual pod resource utilization and horizontal and vertical scaling can be set in a simple, centralized fashion. Indeed, limits on the overall distributed system count of pods can also be set in the same way as though the distributed system were one standard Kubernetes cluster.

It also becomes possible to use policy instructions to specify developer and operations teams’ intent for the entire system in broad terms, thus managing their desired outcomes for cost versus performance, security or compliance. This controls costs both directly (through resource controls, etc.) and indirectly (by shifting cost left/earlier in the DevOps process). 

Thirdly, dynamic, location-aware placement automation allows for adaptive endpoint selection. As workloads scale or demand changes geographically, the system responds by spinning endpoints up or down on a global (and local) basis, then moving workloads around intelligently to run only (and at least) in the right place at the right time rather than in all places at all times.  This is a significant win for distribution cost management. For instance, a policy statement such as “run containers in regions when there are at least 20 HTTP requests per second emanating from that region” becomes a resource request instruction to the underlying clusterless system to deploy the workload only where and when that criteria is met – so resources are not consumed in regions where there is insufficient traffic, but regions with sufficient end user demand are satisfied by workload placed close by.

Finally, egress cost can be better managed, not only by keeping more data and traffic out at the edge endpoints rather than buried in hyperscaler networks, but also by avoiding the placement of workloads in high egress cost systems (such as the aforementioned hyperscaler networks).  With the simplicity of a clusterless system, lower-cost compute locations can deliver powerful distributed outcomes.

A Cloud-Native Mindset 

Ultimately, building a distributed cloud native architecture requires a different mindset than we’re used to in a traditional data center environment, especially when looking at deployment through a cost lens. Teams need to give careful consideration to TCO as a factor in thinking through deployment architectures, as deploying to distributed hyperscalers can become very expensive, very quickly–not only in terms of hard compute and egress costs but also in terms of the potentially greater and more critical costs associated with team and time to market.

Different architectural approaches, including clusterless, better deliver on the cloud-native promise of running containers and apps where and when needed without breaking the budget. Those willing to shift their perspective in approaching distributed deployments will benefit.


To hear more about cloud-native topics, join the Cloud Native Computing Foundation and the cloud-native community at KubeCon+CloudNativeCon North America 2022 – October 24-28, 2022.

Managing Container and Kubernetes TCO

Leave a Reply