Achieving Distributed High Availability with AKS Hybrid : Hands-on PoC

Azure Stack HCI (hyperconverged infrastructure) is a solution that hosts virtualized Windows and Linux workloads and their storage in a hybrid environment that combines on-premises infrastructure with Azure cloud services. AKS Hybrid is an on-premise implementation of Azure Kubernetes Service (AKS) orchestrator which automates running containerised applications at scale and runs on Windows Server with Hyper-V, and the Stack HCI platform. Together they provide a solution for hosting highly available workloads on-premise. Azure Arc is a cloud based control plane which can be used for managing on-premise AKS Hybrid instances. Flux is an open sourced set of continuous delivery solutions for Kubernetes. AKS and AKS Hybrid natively support Flux through their GitOps capabilities. GitOps is a modern approach to managing and automating the deployment and operation of software applications and infrastructure using Git as the source of truth. Git is a distributed version control system that tracks changes in source code and facilitates collaboration among developers.

This is the second part of a two-part story, you can find part one here. This article and associated links walks you through the process of creating an AKS Hybrid PoC using a real-world use case.

Introduction

You may remember from part one that we worked with a customer to create a proposed solution that addressed their requirements to provide a resilient, flexible, next generation architecture to support their call centre triage service.

Once you have a proposed architecture, the next logical step for most customers is to test the hypothesis with a proof of concept (PoC). A PoC not only gives a customer the opportunity to test the validity of all or part of their idea before committing to fully implement, it also provides a deep learning exercise for the implementation and support teams, in what may be a technology they are unfamiliar with. Often customers ask the Microsoft team to work alongside them during PoC, which can accelerate the process, condensing weeks or months of work into as little as 3 days.

Successful PoCs are usually tightly scoped, with success criteria defined up-front. Proving an idea valid or invalid are both successful PoC outcomes.

Challenges and Requirements

Although we had worked with the customer’s team to create a hypothetical solution, there were still unknowns that would add risk to any decision to commit to the design. This is PoC territory.

We worked closely with the customer’s architects to define the areas of uncertainty – what parts of the architecture were they unsure of, felt needed greater understanding, answered important questions, and/or was needed to demonstrate in order to get buy-in from other stakeholders.

For the PoC we arrived at the following list of MUST-HAVEs:

  • A simulated supplier and cloud environment.
  • AKS Hybrid deployed on the supplier environment, and AKS on the cloud environment, with a clear understanding gained of what would be required from the supplier.
  • Both clusters centrally managed, and with running workloads.
  • An understanding of release and versioning for the workloads.
  • An understanding of the consistency of security, logging, monitoring, troubleshooting capabilities across cloud and hybrid solutions.

And the following list of STRETCH capabilities:

  • GitOps demonstrating new workload deployment.
  • Demonstrable end-to-end messaging from supplier application to cloud data store.

Sounds like a lot for 3 days, right?

thumbnail image 1 captioned High-level PoC Overview

High-level PoC Overview

Proposed Solution

You can find the resources to follow along with the PoC implementation here.

What we Learned

The architecture worked as designed, supporting centrally located BAU operations which could be used to manage both cloud based, and on-premises workloads.

As a reminder, the stated high-level requirements for the solution were to support:

  • A simplified supplier operation to address the variability in supplier capabilities,
  • Possible loss of connectivity between on-premise and cloud,
  • Deployment options to suit suppliers’ varied landscape, including VMware.

The PoC showed that standard Git repositories and CICD pipelines, with AKS GitOps, could be used to manage a gated API lifecycle effectively, and at the scale required (50+ clusters). It dramatically reduced the burden on suppliers, and gave the customer the ability to manage and maintain their solution with a high degree of control, insight, and rigor.

It was important for the customer to understand what requirements the solution would place on the suppliers who would host the on-premises AKS Hybrid deployment. By walking through the deployment process our PoC was able to provide clarity on this, and output a bulleted list of the hardware, software, and configuration that was required on the supplier side.

We were able to discuss security aspects of the solution, including the use of Defender to secure the containers, and create a list topics to look at in further depth at subsequent sessions like logging and troubleshooting.

It demonstrated that the solution would continue to function within expected parameters when cloud connectivity was dropped, and that through the use of resiliency and reliability patterns while core messaging would be interrupted, it would resume once connectivity was restored with no loss of data.

AKS Hybrid can be deployed to Stack HCI and Windows Server, with VMware currently in Private Preview, providing an excellent choice of deployment options.

While AKS Hybrid is an on-premise instance of AKS, some dependencies with the cloud remain – including for log shipment, monitoring, API management, and billing purposes. This was already known, and the normal mode of operation will be cloud connected. However, it was important to understand the limits of the product. These vary with the deployment platform – when deployed on Stack HCI the platform must be connected to Azure at least every 30 days to remain operational, see here. The limits are less distinct when running on other platforms such as Windows Server, but can eventually result in undesired drift in AKS. Fortunately support for 30 days offline was adequate in this use case.

In Conclusion

The PoC was successful in creating a demonstrable implementation of the specific customer use case that they could take forward and use as both a basis for ongoing experimentation and MVP, and as a demonstration environment. It allowed the implementation and support teams to upskill quickly and in a safe environment and encouraged enthusiasm and curiosity around the platform. Better still, it threw out questions which had not been considered before, which we were able to answer together, de-risking any future full-scale implementation.

Original Post>