DevelopersDevOpsFeaturedLet's SeeSREsVideo

Demo: How Jeli Relied On Its Own Incident Management Platform

0

Guest: Drew Stokes (LinkedIn)
Company: Jeli (Twitter)
Show: Let’s See

Effective incident management requires going beyond technology and the contributing factors of a specific issue. It is understanding the way an organization operates, its priorities, the way that the systems are maintained, and the context of interactions between teams. Jeli helps organizations recognize the gaps in the socio-technical aspect of their operations and bakes that knowledge into its incident management platform.

Jeli itself had some incidents a few weeks ago and they leveraged their own platform to go deeper into those incidents. It’s an excellent case study of how companies can or should use Jeli to learn from incidents and further improve their processes.

It seems that in a very subtle way, Jeli is also becoming a catalyst of cultural change within organizations, improving their internal processes to better handle incidents that make them more efficient. 

In this episode of TFiR: Let’s See, Drew Stokes, Head Of Engineering at Jeli, demonstrates the capabilities of Jeli and how it can become a catalyst in aligning tools and practices in managing incidents within an organization.

Highlights of this video demo:

  • The staff at Jeli are all incident response and analysis experts. They get to build a tool that they use every day to respond to their own incidents.
  • Jeli has an Incident Response (IR) Bot that standardizes the process of responding to incidents, so responders don’t have to look at incident runbooks or try to remember the sequence of steps that they need to take to get an incident going.
  • Its Narrative Builder makes it easier to build those incident narratives and do the analysis. Over time, themes and takeaways can be presented in a way that helps companies make decisions about what engineering work needs to be prioritized and where there may be gaps in headcount or skill sets on particular teams.
  • Stokes proceeded to show how the incident response process works in a Slack demo environment, including how to add channels that you want to send status updates to, create conference bridges with Zoom, create incident tickets in your ticket management platform, specify who is the incident commander, how to conduct incident response in a private channel if information is sensitive.  
  • He demonstrated what an actual responder might do when an incident occurs. With the press of a button, an incident channel and a Zoom bridge are established. He starts responding in whatever way makes sense, given the incident. This might include doing searches, bringing in other folks into the channel, assigning them specific roles, dropping in the bridge to discuss, closing the incident out by marking it closed and indicating that the incident has been mitigated.
  • The live updating view allows folks who are not a part of the incident response team to get status on what’s going on. In the channel, folks can respond, indicating their specific roles and what group they’re in.
  • Once this incident is done, Jeli gathers the data, which includes the entire Slack transcript, the on-call rotations for folks involved, and all the communications to be used for analysis.

What sets Jeli apart:

  • It is people centric. The focus is on the incident rather than pushing information around. It tells the story of the people that were involved, what they knew at the time that they responded, what they didn’t know, and how these specific types of events affect the teams and the broader organization.
  • It is customizable. Every organization’s culture is different in terms of the way they think about and respond to incidents, the way they analyze and incorporate what they learned into their organizational process.
  • It helps with education. Looking at past incident data helps with coordinating the response as well as deepening the understanding of complex systems over time.
  • It helps tell stories. When an incident occurs, there are folks who are experts in using a piece of technology and understand it deeply. And there are other folks who need to understand it, but not at the same depth. Telling the story about what’s happening can be challenging. It requires data and it requires finesse. Jeli is an effective storytelling tool, especially for SREs and folks in the incident space to explain what they know to groups within the organization that don’t have the same expertise.
  • Case in point: One of Jeli’s customers had a piece of technology that had been causing them problems for months. A frustrated senior member of the org wanted to get rid of it and replace it with something else. The folks who responded to that incident were able to convey and substantiate the fact that the problem was not the technology itself, but the way it was implemented. Instead of replacing the technology, the organization needed to invest time in getting it properly configured and having the team utilize best practices.

This summary was written by Camille Gregory.