The challenge for Kubernetes (K8s) has been its complexity coupled with the lack of in-house skills to provision and troubleshoot Kubernetes clusters in the cloud or on-premises. While K8s adoption is consistently growing, many developers and operations teams are still relatively new to the technology, having to learn new terms, workflows, tools and more.
K8s’ discrete parts require a broad level of expertise, even just in the setup process. Consider that spinning up a K8s cluster requires knowledge about and configuration of multiple components from pods to services, not to mention resources like etcd, API servers, kubelet and kube-proxy.
Then there’s scheduling, scaling and networking to contend with. One misstep can quickly translate into myriad issues with scalability, reliability, even security.
Further, the ecosystem itself is constantly rapidly growing and evolving. Tools and add-ons can be a lot for beginners, and hard to keep up with. Formal education and training around it is not necessarily part of every developers’ background, academically or professionally.
And we can’t forget that the technology, with many moving parts and complex interactions, can be difficult and time-consuming to troubleshoot when failures occur. Diagnosing the causes of the failures requires in-depth technical knowledge and expertise that often live inside a few experienced engineers’ heads.
Let’s dive deeper and explore new and innovative ways that can help overcome the glaring skills gap problem.
Upskill Your Team with ChatGPT Assistance
Kubernetes is challenging to learn and use effectively because there’s no one-size-fits all approach to it. K8s is highly customizable and can be configured in many different ways, depending on the specific needs of your application or infrastructure. It’s often difficult to apply what you’ve learned from documentation (and there is plenty) and training to an existing environment because teams lack contextual understanding and visibility of their architecture.
What does the current architecture look like? What pods are tied to a specific namespace? What is the health of the nodes? Asking basic questions of our environment requires context switching between the AWS console, the kubectl command line, Terraform configuration files and monitoring tools.
What if we could ask ChatGPT these questions?
Let’s look at an example of using PromptOps powered by ChatGPT to understand all the deployments within a cluster. PromptOps offers a free Kubernetes consulting tool that lets users ask questions and get immediate assistance in the form of BASH scripts, documentation references and other helpful resources.
By providing PromptOps fragmented data of our infrastructure from different sources, such as Confluence, Notion, Terraform config files and more, we expect PromptOps to be able to quickly aggregate all the information and help visualize the architecture. Instead of manually running kubectl commands to check the deployments, we prompted PromptOps over a chat to describe which deployments are present in our cluster.
Here is PromptOps’ response:
PromptOps > I suggest running this script to answer your question.
PromptOps gave a script to run along with a reference to a Notion page which contains information about the cluster.
Upon executing the script, PromptOps provided the output of the deployments within the cluster.
PromptOps also provided a visual graph of all the deployments in the cluster to make learning much easier.
The response from PromptOps simplifies Kubernetes infrastructure management, empowering everyone on the DevOps team to get up to speed on current architecture and keep up with changes.
Save Time with Automated Incident Investigation
Let’s look at troubleshooting production issues and see how PromptOps can help identify the root cause of failures in Kubernetes. Kubernetes metrics and logs offer critical insight across all levels from the cluster down to the container. They provide information on resource utilization, pod and node status, cluster autoscaler metrics, network errors, container runtimes and more.
But the issue with all these metrics is that they’re not immediately useful. A log stream or metric chart doesn’t provide the necessary context or direction. There are tools to help, but they often make things more complicated, requiring the pulling context across the whole tech stack just to see what really happened or who did what when issues arise.
At CtrlStack, we’ve been incorporating PromptOps to analyze causal graphs, Kubernetes events, metrics and logs to reduce the challenges of manual troubleshooting. The goal is to help DevOps teams quickly understand the Kubernetes issues by using PromptOps to generate root cause analysis summaries that explain the cause. In this example, a causal graph of the root cause analysis dashboard within CtrlStack is input to PromptOps for troubleshooting and resolution as described below.
Here is an example of a casual graph of the root cause analysis dashboard.
With this, we prompted PromptOps to explain the graph. Below is PromptOps’ response:
PromptOps > Here is the explanation of the graph:
- The graph is a visualization of the root cause of an error in a Kubernetes pod. The error is caused by changes in the pod’s image, which are similar to changes in the Kubernetes
pod_errormetric. The error is also similar to the inflection points found in the
kubernetes_pod_mem_usagemetrics, and to the topology changes in node1.
- The graph is showing the root cause of a
kubernetes_pod_errorchange. There are three insights. The first is that the image has changed. The second is that there are inflection points found in the metrics
kubernetes_pod_mem_usage. The third is that there is a topology change for node1.
The response from PromptOps provided helpful correlation of information across uptime and memory usage metrics as well as related topology changes. These insights involved checking the Kubernetes change events, metrics, resource usage, and topology changes. Based on this automated incident investigation, developers and operators should have sufficient context to quickly determine next steps to fix the problem.
Combining ChatGPT with Change Intelligence for Closing the Skills Gap
Based on the examples provided, it’s apparent that ChatGPT can provide valuable assistance to close the Kubernetes skills gap. ChatGPT offers fast insights and clear explanations for DevOps teams to troubleshoot and fix production issues. This empowers junior operators as well as developers who are new to Kubernetes to learn the technology and resolve common issues independently.
While ChatGPT’s responses can provide a quick understanding of the problem, it requires contextual information specific to different issues of Kubernetes deployments. That’s where change intelligence comes in. Change intelligence platforms provide the causal graphs that connect resource capacity, infrastructure changes, configuration changes, metrics history charts and an event timeline to optimize the path to root cause analysis.
A ChatGPT-based approach to learning Kubernetes has the potential to significantly improve DevOps productivity while eliminating cognitive overload. By combining ChatGPT with change intelligence, teams can uplevel their Kubernetes skills and gain better observability.