Reducing Kubernetes Downtime by 80% via Maintenance Automation
Urban Dynamics takes the confidential nature of its work for all clients, from enterprises to startups, very seriously. As such, this is an anonymized case study of Kubernetes work done for a client in 2023 and 2024. We have not altered any of the technical details below, only omitted the business identity and context in which it was done.
In 2023, Urban Dynamics was engaged by an enterprise client regarding downtime challenges with their on-premise Kubernetes clusters. They had chosen Kubernetes as their infrastructure platform to handle critical systems which their revenue depended on as they grew. While Kubernetes is a great infrastructure platform for these critical workloads, any downtime of the Kubernetes clusters causes downtime for the services that run on them. And one way that a Kubernetes cluster can experience downtime is from irregular manual maintenance, as our client was facing.
Through Ansible automation and infrastructure-as-code, Urban Dynamics was able to greatly reduce mean-time-to-repair (MTTR) by 80% for the Kubernetes clusters and by extension improve uptime for all the workloads that ran on them. This translated to improved uptime for customers and a proportional reduction in lost revenue from system downtime. Below this case study covers business challenges faced due to this problem, the solution we implemented, and the impact of that solution.
If you're interested in learning more about Kubernetes, refer to the Appendix: What is Kubernetes section at the bottom of this case study.
Business Challenge
As our client expanded their reach and complexity with increasingly ambitious customer experiences, they encountered operational challenges with their Kubernetes clusters. Kubernetes can generally be run in one of three ways: fully-managed where it's provided as a service operated by a third party, partially-managed where a third party has provided tooling or services to aid in operation, and self-managed where end-to-end operation is done without third party assistance.
This enterprise client had chosen to run partially-managed Kubernetes clusters in an on-premise manner. Examples of partially-managed Kubernetes solutions include Google Kubernetes Engine (GKE) and Rancher, where they provide tooling and automation that reduces the complexity of maintaining Kubernetes. This allowed businesses to focus on building the services that supported their ever-growing business-critical software footprint while still retaining full control of their Kubernetes clusters.
However, partially-managed Kubernetes clusters generally require operators to define and implement their own maintenance and upkeep processes. This is intentionally left as an excise to the businesses using the clusters as a desired quality that gives businesses maximum control. If they didn't want this control then they would have chosen a fully-managed solution instead.
The overhead of these maintenance and upkeep processes had led to prolonged mean-time-to-repair (MTTR) for our client during incidents and consumed extensive engineering hours for new production deployments. Their DevOps team, with a commitment to delivering seamless user experiences and ensuring system stability across their various sites, needed a solution to streamline this infrastructure management and deployment processes.
Solution
Urban Dynamics partnered with this client's DevOps group to address these challenges through automation and infrastructure optimization. By leveraging Ansible — an open source IT automation engine — we automated Kubernetes environments and associated resources, enabling our client to transition to infrastructure-as-code wherever feasible. Comprehensive procedures for cluster upkeep were developed, accompanied by thorough tooling and process documentation.
We scaled out use of infrastructure-as-code from a handful of production services to nearly the entirety of them. With the fast moving nature of their business, they needed a robust solution to effectively roll out and roll back production changes with high safety guarantees. Through our work with them, they were able to automate the deployment of Kubernetes, cluster management, and more to satisfy this need. To improve safety of production deployment rollouts, we introduced a shift to a blue/green deployment model which greatly reduced risk in production deployments.
To validate wide spread changes to customer experiences before those changes make it to the public, we built out an automated continuous integration and deployment pipeline for these services to automatically deploy changes to this sandbox environment for testing and validation of changes on-site at their locations.
Additionally, to go the extra mile and support our client's software developers, we built a replica environment of production as part of an ongoing effort to bring untracked infrastructure under infrastructure-as-code (IaC) which developers could use as a service backend for development without having to waste resources standing up complex production-like environments themselves.
Impact
As a result of working with Urban Dynamics, our client was able to improve their existing infrastructure and drive changes to the core customer experience with much lower risk, increased deployment speed, and minimal internal engineering effort or change. The DevOps team benefited from greatly reduced engineering-hours for Kubernetes cluster maintenance thus enabled them to focus their efforts in supporting new projects and other revenue critical priorities.
The implementation of automation and infrastructure-as-code for their production systems improved operational efficiency and reliability by an order of magnitude. The mean-time-to-repair the Kubernetes clusters was reduced by 80%, empowering swift resolution of incidents and minimizing disruption to customers. Moreover, engineering hours required for new production deployments were halved, allowing teams to focus more on innovation and product development.
Conclusion
By embracing automation and infrastructure-as-code, this client, in collaboration with Urban Dynamics, was able to implement robust protocols and tools for Kubernetes cluster maintenance, significantly improving stability for cluster, and thus improve their customer experience; plus extend this solution to also benefit software developers who's work ran on these Kubernetes clusters. This not only optimized their operational efficiency but also directly served their commitment to delivering extraordinary experiences to customers worldwide. The results from this are a testament to the transformative power of technology processes and automation to elevate the reliability and agility of enterprise software systems.
If you have questions or would like to further discuss this case study, please feel free to reach out to us at Urban Dynamics. We're always excited to talk with others who are interested or passionate on these topics!
Appendix: What is Kubernetes
Kubernetes is a natural fit for critical workloads as it is an open source system for automating deployment, scaling, and management of containerized applications. It is an infrastructure platform that helps businesses keep their important online systems running smoothly, even under heavy demand. Think of it like an automated traffic manager for a website or app. When customers interact with these systems - whether making purchases, checking accounts, or using services - Kubernetes ensures everything stays up and running, even if there's a sudden surge in activity.
If you're interested in reading more about Kubernetes, you can reference their official website: kubernetes.io