Author: Mandi Walls, DevOps Advocate, PagerDuty
Bio: Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps technology organizations increase their effectiveness with modern IT practices and unplanned incidents. She is a regular speaker at technical conferences, and is the author of the whitepaper “Building a DevOps Culture” published by O’Reilly. She is interested in the emergence of new tools and workflows to make the task of operating large complex computing systems more approachable.
Kubernetes has become ubiquitous in the enterprise software world, with 79% of organizations now using it to power their applications. As organizations adopt Kubernetes, however, they’ve also had to rapidly learn new tools and processes to manage and monitor their container ecosystems.
To help them in their Kubernetes learning curve, organizations often find two techniques to be invaluable for incident response: implementing a service ownership approach, and automating processes to reduce complexity for first responders. Here’s why these techniques can be so valuable for organizations.
Service ownership for issue identification
As with virtually any enterprise software, it’s inevitable that applications powered by Kubernetes will suffer outages or slowdowns. However, the complexity of Kubernetes poses a problem for incident response teams, who often need support from subject matter experts when problems strike.
To address this challenge organizations need to implement a service ownership approach, which logs and notes the developers and engineers behind a Kubernetes environment or container cluster. When a Kubernetes environment or application suffers an incident, a service ownership approach means that the most qualified team members to help remediate can be brought in to triage. By embedding a “Code it, Own It” mindset into Kubernetes management, service ownership reduces the complexity faced by incident responders who are less knowledgeable about specific applications or services.
There are significant benefits in adopting a service ownership approach. Firstly, it creates a far better experience for customers as it helps give developers greater insight into how their code is operating in production. Secondly, service ownership helps improve quality control by reducing time spent discovering which team member is responsible for what. And finally, service ownership reduces the amount of people needed to troubleshoot and analyze a problem – this combined with the expertise of the service owner significantly reduces mean time to resolution (MTTR) for incidents.
Service ownership requires a cultural shift and significant organizational buy-in, and a willingness to break down the silos that separate developers from operations and incident teams. But once implemented, service ownership offers the best approach to minimizing the complexity of Kubernetes incident response.
Process automation for rapid response
When an incident strikes a Kubernetes environment, most first responders won’t have the expertise or time to understand the root cause of an issue or problem. As a result, responders often need to bring in engineers and developers to help diagnose the cause.
The diagnostic phase is often the lengthiest part of the incident response process, taking up to 85% of the workflow’s time. Diagnostics can often result in at least four engineers running through standardized and repeatable steps like CPU and memory checks, or reviewing recent code commits. This use of engineering resources is expensive and represents a significant opportunity cost – every escalation means less time spent by engineers on delivering innovation for an organization.
However, because many of these diagnostic workflows are standardized and highly repeatable, they’re also ideal candidates for process automation.
Process automation entails teams turning diagnostic workflows into runbooks, which then automatically run through these tasks on command. Once developed, libraries of defined process runbooks can then be given to incident response teams to trigger without having to ask for engineering assistance. Along with diagnostic work, remediation processes can also be automated, such as repetitive tasks like server restarts or memory cache clearances.
When empowered with process automation, incident response teams can handle more incidents in a Kubernetes environment while reducing the amount of interventions needed from engineers. This results in shorter resolution times, fewer disruptive escalations, and more time spent by engineering teams on delivering innovation.
Leveraging the operations cloud
When managing Kubernetes applications and environments, it’s essential to give teams the tools to quickly remediate issues. Service ownership and process ownership can be critical in ensuring robust Kubernetes incident response processes: by reducing the complexity involved in discovering and escalating Kubernetes incidents, and by slashing time spent on manualized and repeatable workflows.
Both service ownership and process automation lend themselves to one-button, low-code/no-code solutions for Kubernetes incident diagnostics and remediation. However, to work, both need to be deployed as part of an “operations cloud” that constantly updates to reflect changes in service ownership and best practice. A Kubernetes environment is inherently subject to rapid change, so an operations cloud is essential for systematizing processes around software management and incident response.
Once implemented atop an operations cloud, service ownership and process automation can help teams triage incidents, resolve them faster, and save engineering time. As a result, organizations can guarantee reliability and consistency from their Kubernetes ecosystems and applications.