Cloud Native

Automating Incident Response: How AI Helps SREs Reduce Toil and Complexity

0

Incident response is a critical, often stressful, part of modern software operations. Getting paged is just the beginning; orchestrating the response, learning from the event, and preventing recurrence is where the real challenge lies. At KubeCon + CloudNativeCon Europe in London, Swapnil Bhartiya spoke with JJ Tang, Co-Founder and CEO of Rootly, to understand how his company is tackling this challenge.

Rootly offers an on-call incident management platform aimed at streamlining this entire process, used by prominent companies including Nvidia, LinkedIn, and Dropbox. But how did Rootly come about, and what makes its approach different?


📹 Going on record for 2026? We're recording the TFiR Prediction Series through mid-February. If you have a bold take on where AI Infrastructure, Cloud Native, or Enterprise IT is heading—we want to hear it. [Reserve your slot

Incidents Happen. Rootly Happens Next.

Like many successful startups, Rootly was born from direct experience with a problem. Tang previously worked at Instacart, the grocery delivery giant. While they had basic on-call paging tools, Tang and his Co-Founder (Instacart’s first SRE – Site Reliability Engineer) found a significant gap after the initial alert. “What I quickly realized was that after you got paged, there wasn’t a great way to organize and orchestrate the incident,” Tang explained.

This frustration led them to build an internal tool to manage incidents more effectively. As it gained traction within the company, they realized that “everyone else in the world also has incidents.” Motivated by this insight, they left their jobs to launch Rootly as a commercial offering.

Since its founding roughly five years ago, Rootly has seen significant adoption. Tang notes that incidents aren’t going away; in fact, they’re arguably becoming more complex. He attributes some of this increased complexity to the rise of AI coding assistants like GitHub Copilot. “Sometimes people don’t understand the code they’re pushing or the dependencies that are being created,” Tang observed, adding that this complexity has fueled Rootly’s adoption. The platform is industry-agnostic, helping companies from Replit to Figma manage their incidents.

Companies turn to Rootly to bring consistency and maturity to their incident response processes. Tang highlighted Dropbox as an example, where the goal was to standardize incident reporting, post-mortem writing, and ensure follow-through on action items to facilitate learning. “That’s a very popular use case,” Tang said. “You want to be able to go from the alert … all the way to the post-mortem, and tie this feedback loop closer and closer so you can get more preventive in the future.”

Rootly’s Differentiated Approach: Beyond Alert Fatigue

Alert fatigue is a well-known problem in operations. While deterministic methods exist to control alert volume, Rootly is pushing further using AI (artificial intelligence). Tang described developing more “agentic” capabilities to automatically correlate alerts, distinguish noise from signal, and determine if action is needed.

The most exciting development is Rootly’s “agentic SRE,” designed to autonomously assist with troubleshooting. “Imagine a P3 alert. Usually, that would require the time of a very expensive SRE to determine what happened,” Tang explained. “We can suggest the fix for you automatically, which significantly reduces time and fatigue.”

However, Rootly recognizes that tooling is only part of the solution. Tang emphasized the interplay of people, process, and tooling, stating, “I think tooling—even ours—is a minority part of that.” Consequently, Rootly also advises clients on improving their overall incident response processes.

Product Offering: SaaS and AI

Rootly’s core offering is a SaaS (Software as a Service) platform comprising several key components:

  • An on-call platform with a mobile app.
  • An incident response product facilitating collaboration within tools like Slack or Microsoft Teams.
  • The aforementioned “agentic SRE” for automated debugging and troubleshooting.

The Future: AI Labs and Innovation

Looking ahead, Rootly is heavily invested in exploring how large language models (LLMs)  can further transform incident management. Recognizing the challenge of helping large enterprises adopt AI safely and productively, Rootly has launched its own AI Labs. This R&D initiative brings together technical experts (including former AI and platform leaders from Twilio and Venmo) to build open source projects and push innovation in the space. “It’s one thing to build with AI; it’s another to innovate with AI,” Tang stated. “And I think if you really want to be a company on the bleeding edge, you must do both.”

By combining a comprehensive platform with a forward-looking approach to AI and process improvement, Rootly aims to help organizations move beyond reactive firefighting towards more proactive, efficient, and less fatiguing incident management.

Guest: JJ Tang
Company: Rootly
Show: KubeStruck

Kubernetes 1.33: Native Sidecars are here, plus Big Security Boost with User Namespaces

Previous article

Why Robust Java Support Is Non-Negotiable for Enterprises

Next article