Guest: John Egan (LinkedIn, Twitter)
Company: Kintaba (Twitter)
Show: Let’s Talk
Keyword: Incident Management
Kintaba, the incident response and management company, is gearing up for IRConf, a conference dedicated to Incident Response. John Egan, CEO and Co-Founder of Kintaba, found that incident response also featured as part of other more generic industry conferences rather than being the sole focus. The conference aims to bring together industry experts from companies like Google, AWS and Netflix, as well as new voices for a half-day virtual event on April 1.
One of the challenges of incident management and response is that most incidents come from a customer reporting an incident but it is rarely reported directly to an engineering or product team. The first line of response is usually reported to someone in customer success or customer support instead and then goes through a process where it is passed on to the other departments to look into. Kintaba aims to give customer success and customer support personnel the ability to declare incidents.
Kintaba’s platform aims to provide companies with the tools they need to respond to incidents straight away. It brings together the right on-call people to deal with the incident, integrates with other tools in the stack, and tracks the mitigation and resolution status. Whereas traditional tools focus on a couple of formal incident responders within the organization, Kintaba aims to create a holistic approach to incident management. The platform aims to be accessible to the whole organization, with a full collaboration chat experience, both in the app and also bidirectionally connected to Slack.
Lowering the communication barriers to let the entire company participate is something that Egan feels is particularly important when dealing with incidents. Traditionally, only a small portion of the company has the full picture of an incident, while the rest of the company gets fed information on a need-to-know basis, often cascaded-down, which can be frustrating. However, giving everyone access to the source data and managing their role and participation without having to send numerous emails helps lower the communication barriers and make incident response more accessible.
About John Egan: John Egan is CEO and cofounder at Kintaba, the modern incident response and management product for teams. Prior to Kintaba, John helped to lead enterprise products at Facebook.
About Kintaba: Kintaba is a modern incident management platform built by former Facebook engineers. It lets companies and teams implement best-practice incident response processes without the overhead, creating a seamless workflow for more effectively responding to major outages.
The summary of the show is written by Emily Nicholls.
Here is the full unedited transcript of the show:
- Swapnil Bhartiya: Hi, this is Swapnil Bhartiya, and welcome to another episode of TFiR Let’s Talk, and today we have with us once again, John Egan, CEO and co-founder at Kintaba. John, it’s good to have you on the show.
John Egan: It’s good to be here.
- Swapnil Bhartiya: Kintaba is organizing IRConf or Incident Response Conference, which is, if I’m not wrong, the first conference dedicated to incident response. I’m curious that, why do you think that this is the first conference? Do you think that we have never done a conference like this before? And if not, why? Because incident response seems to be a topic which is of interest to a lot of people.
John Egan: Yeah. The genesis of this is very much just like you say, we expected there to already be a conference out there for incident management of some sort, and when we were looking for one to participate in, we found SREcon and DevOpsDays. All these other sort of more generic industry conferences that tend to have pieces about incident response, but as this industry moves more and more towards kind of a wider audience, a more interesting community, a multidisciplinary community across people inside of companies. It got more and more interesting to us to say, “Well, maybe we could just put an entire agenda together that’s really focused on incident management, incident response across organizations.”
And it really just took off from there. We ended up putting together a really fantastic list of speakers who were excited. And I think it’s starting to take off, which is pretty exciting when you build a conference. It’s always a bit nervous when you come out and want to bring a new community together, but I really have the sense that this community needs a place to come together. And IRConf is really meant to be that. It’s very purposefully free, easy to access and virtual so that anyone can attend.
- Swapnil Bhartiya: Can you talk about what are going to be some of the key topics or ideas of focus of this conference?
John Egan: Yeah, so we’ve got Emily Freeman, is a keynote in this who’s the author of DevOps for Dummies, really fantastic speaker. She gave an oversubscribed speech at AWS re:Invent this year, that was pretty fascinating around the software development life cycle. And here she’s focusing more on like revolutionizing incident response, right? These big changes that we’re really seeing in the industry around who is participating, and when do we participate in incident response, which is pretty exciting. We’ve also got J. Paul Reed, who’s a senior applied resilience engineer at Netflix. We’ve got Dave Rensin, who previously was a SVP at Google, who handled customer reliability engineering, which is a really cool topic. And then we’ve even got people like Christine Yen from Honeycomb, and Pedro Canahuati from 1Password.
Like what’s so cool about this is that the topics that are going to be covered are from all sorts of different organizations in tech. And primarily, folks who are sort of up near the top, talking about the cultural impacts as well as the practical impacts of how do we really do a good job of being stewards of incident management within our companies as this space starts to take more shape?
So the topics are everything from, how do customers play in, right? How is incident management changing as it spreads across the organization? How do we do incidents at scale, right? How do we rethink on call set up. There are these really kinds of philosophy challenging talks that we might take the way traditional organizations deal with things today, where incidences we’ve talked about before you, and I, tend to be dealt with as panic, right? And instead, what are the actual practices that you can apply to this space and start to be a good practitioner? So, those are some of the topics being talked about and some of the speakers.
- Swapnil Bhartiya: How does Kintaba approach incident management, which is kind of different from more traditional approach, and why?
John Egan: So Kintaba takes a very holistic approach to incident management, where we believe really strongly that the entire organization can participate in this process. You’ll see a lot of traditional tools out there, really focus on a couple of people who are formal incident responders in the organization. Five or six folks may be inside of SRE, maybe inside of the engineering organization. And Kintaba really takes this attitude that the incident management process should be accessible to the whole organization. So everything about the product is really accessible, easy to use. We have a full collaboration chat experience, both in the app, as well as bidirectionally connected to Slack.
The dashboard experience is really easy to access. Everything is point and click UI. And we really try to steer people away from structured data, right? We don’t want to feel like this IT tool that you have to come into and fill out like 300 data points so that a report can be built. We want give you the operational tools to be the on-call responder for legal, the on-call responder for PR, as well as the on-call responder and customer success and engineering and bring all of that together. So a, I think we’re really the first company to take that holistic approach to say, “This is a company-wide action.” And I really think IR reflects that, when you look at the kinds of things that people are talking about, this is the shift. Sometimes we call it the shift left, right of the industry. How do we pull this industry more and more towards the customer, which when you’re pulling out of like SRE and engineering means more and more towards the entirety of the organization from an operational standpoint.
- Swapnil Bhartiya: When you mentioned that getting the whole company getting involved in the incident management, the interesting fact is that in today’s world we have kind of moved a bit from silos, but we still have soft silos like DevOps [observatories 00:05:42]. The different teams, different specialization. But the interesting thing is that when something breaks, when a company gets tagged, or hacked or compromised, that’s when the whole company comes together. That’s where we people see each other face, “Oh, you actually work there, you do work there,” because everybody works in their own areas. So, which is more of like reactionary, you react to something. But here, what you’re saying is that, to bring a cultural change for the whole company to work together. So can you talk about, first of all, realistically, can you get the whole company involved in that?
John Egan: So, your first point, 100%, nothing knocks red tape down like an emergency, right? Everything has process. We saw this with governments during the coronavirus pandemic, right? All of a sudden, we needed to lower that barrier for communication across parts of the government, into the medical industry, and across entire nations, because we had an emergency going on. And I think companies are discovering the same thing, and we’re already doing this. So, the mistake to make would be to say that it’s a change that the whole company is participating. The reality is, the whole company already participates in incidents. We simply only traditionally, give the tooling to a very small portion of the company, right? And everyone else has to reach in through these really annoying communication barriers, right? I talk to my manager who talks to your manager, who talks to someone else’s manager, who asks a friend who knows where the channel is, or the incident’s being dealt with. And finally we get back, “Okay, it’s going to take another hour.” And then that works its way back up and across the institution.
And what good incident management is all about, is about taking that administrative overhead and just removing it, right? Giving everyone access to the source data, and managing your role and your participation without having to send emails and having to like make communication interrupts. And so, I think that’s happening anyway. And I think tooling like Slack has really moved entire organizations to have more access internally. You don’t have to know someone’s email address, you don’t have to be friends with them ahead of time, you can look them up. And incident management is really that layer, Kintaba especially is that layer that helps to break down those barriers. And the way we really do that practically, and the first thing we do, is we provide a role management system, where you can assign out these roles really easily before the incident happens, right?
So what you want to do before an emergency strikes is know, who’s my on-call in PR? Who’s my on-call in legal? Who’s my on-call in customer success? And a lot of organizations traditionally, only think about on-calls from an engineering standpoint, right? People we need to call when a metric goes in the wrong direction, and we know the owner of that system. But there really are on-calls in other parts of the world as well. You probably need a sales on-call. And so the real, practical effort of that to me, is get those roles defined, and then lean on Kintaba to help bring the right people together. So now when you’re responding in SRE, it doesn’t matter if you don’t know all of the people in PR. You know who the on-call is, and you can bring them in, and that person can start being effective immediately. And the why’s, I think are baked into that, right? The why’s are all about how do we make sure that this kind of a process is predefined in the system? Here’s how we declare, here’s where we go. Here’s where we look to see where the problems are. Here’s where we record our learnings and do our charting.
Like just knowing the one place that’s going to happen is a huge win for companies that historically were doing their response in Slack, writing up their postmortems in Google Docs, putting all of the structured data into Google Sheets and then throwing it all into Excel, and try and get a report out of it, right? The why is just about, take away the pain, and when you take away the painful part, incidents stop being scary. And you recognize that they’re really just part of your business, and they’re part of everyone’s business.
And again, just to keep mapping back to the conference, what’s so exciting about this conference is, all of these companies and the people who were coming in, these are well established, successful organizations, right? This is AWS and Netflix and startups and Google and Honeycomb. These are companies that are good at what they do, and they’re on fire every day, just like every other SMB out there and rapidly growing in large company. And I think that’s the piece that glues all of this together. It’s not culturally understood at a lot of companies that these problems are to be expected. It’s thought of as let’s get rid of them, right? Our goal is incident zero, right? And that’s an impossible goal. And it’s a goal that you really should never have. And instead embracing the idea we’re going to have them and let’s plan for them. That’s how you fix the problem.
- Swapnil Bhartiya: In the end, it doesn’t matter what kind of company we are running. In the end what matters is, how happy our customers are, right? Their success is in respect of, what we do there? What role do you think incident management plays in the success of a customer? Can it be a powerful tool? A useful tool to address customer issues also? If yes, how?
John Egan: Yes, definitely. I think most incidents inside of organizations actually originate with the customer. We like to think about incidents as something that are marked by metrics that we’ve predefined, right? We’re already monitoring our Egress, we’re monitoring our database stability, we’re monitoring our capacities. And we predict, that when things go badly, those metrics will help us to trigger incidents. The reality, and I can say this after seeing thousands of incidents come through in Kintaba, talking to our customers, working at companies like Facebook. The reality is most incidents actually come from a human. They come from a customer reporting something that you’re not tracking, right? If you were tracking and good at this, you would’ve caught it before it became an incident, right? There’d be an automation, there’d be a shell script that would’ve auto scaled your capacity or whatever. But what happens is, on the most unexpected moment, when all of your charts are green, you get a support ticket in from your most important customer who, if you’re small, makes up to 30 plus percent of your business, telling you that something isn’t working. And all of a sudden, it’s all hands on deck.
And the organizational unit that’s usually responsible for that first interaction is very rarely engineering and product, right? It normally is customer success, customer support, and so it’s really important I think to let products like Kintaba get you closer and closer to those parts of the organization, and enable them to declare incidents. Not to ask for someone to look into it and then take an action later on down the road. We’ve talked about this concept before of the big red button, right?
It’s like on a factory floor, there’s this button and anyone can push it, and it’s there because the determination of a safety risk on a factory floor is something that we trust everyone on that floor to be able to do. And if they’re wrong, it’s okay. It’s okay to have a couple of false positives in the name of safety, on a factory floor, at a company. Similarly, it’s really important to provide tools to state that something is happening that’s an emergency, and give that ability to parts of the organization closest to the customer, where we see in the data, most of the incident knowledge actually originating, or at least the symptoms of the incident originating.
- Swapnil Bhartiya: John, thank you so much for taking time out today, and talk about of course, IR Conference, and also the role and importance of incident management in this world. Thanks for those insights, and usual, I’d love to have you back on the show.
John Egan: Thank you. Great, thank you. And I’d like to send anyone who wants to attend Incident Response Conference, it’s irconf.io.