CloudDevOpsFeaturedLet's TalkSREs

Kintaba CEO Talks About Human-Centric Incident Response Plan

0

Incident response is becoming one of the core areas of organizations’ strategies as they want to ensure business continuity in the aftermath of an outage or unforeseen incidents. It has become a topic of interest for both developers and C-level executives. Kintaba has created an incident management platform that lets companies of any size, teams as small as two or companies in the thousands, efficiently and easily respond to critical incidents within the organization. “We provide the full end-to-end process from declaration to collaborative response all the way through to postmortem writeup and distribution, and even the reflection afterward, when the team comes together to look back and learn,” said Kintaba CEO and Co-founder John Egan in the latest episode of Let’s Talk. This episode focused on the newly announced human-centric heatmaps that put focus on the most valuable asset of any company – people, as there are growing concerns around burnout as more responsibilities and roles are falling into the laps of developers and that exhaustion can impact efficiency, performance, and worst of all, business continuity. Egan also shared his views on why we should not put too much focus on metrics and SLAs. It was a great discussion. I hope you will enjoy it too. Please watch the video above of the show.

Here are some of the topics that we covered:

  • Introduction of Kintaba
  • What are human-centric heatmaps for incident response?
  • Why is Kintaba introducing them now? How do they work?
  • How are organizations approaching incident response wrong?
  • Why do we need Kintaba when there is Slack?

Guest: John Egan  (LinkedIn, Twitter)
Company: Kintaba (Twitter)
Show: Let’s Talk

Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya, and welcome to TFiR, let’s talk. And today we have with us, once again John Egan, CEO and co-founder of Kintaba. John it’s great to have you back on the show.

John Egan: Hey, great to be here.

Swapnil Bhartiya: We have been featuring you folks for a very long time, but I want to remind our viewers, what is Kintaba all about. So quickly tell us…

John Egan: Yeah, so kintaba is an incident management platform, that lets companies of any size, teams as small as two or companies in the thousands, efficiently and easily respond to critical incidents within the organization. So, we provide the full end to end process from declaration to collaborative response all the way through to postmortem writeup and distribution, and even the reflection afterwards, when the team comes together to look back and learn.

Swapnil Bhartiya: Excellent. And as you explained, the focus is on incident management. You folks recently announced human-centric heat maps, which is kind of, shifting focus from… Not shifting exactly, but taking focus and putting it on people. So I want to understand from your perspective, first of all, what is human-centric heat map, and what it has to do with incident reporting or response or monitoring?

John Egan: So, there’s a macro trend that’s happening here, right? Where incident management, that was maybe historically practiced by a couple of people inside of companies, is starting to really spread. We talked about this a little bit last time we were together, where we’re seeing more and more parts of the organization participating in this formalized process for doing response, because it makes them more resilient. It makes them happier. It gives them a better culture. And as that happens, as you start to spread to more people in the organization, from say like two or three out to like fifty out to a hundred out to a thousand. What you start to realize pretty quickly is that, you don’t just need insight into how many incidents am I having? How quickly are we closing those incidents, right? You really want to start to understand, who are the people who are involved in this process, because now we’re practicing it consistently.

John Egan: It isn’t something that just happens once in a while. It’s something that’s really impacting people’s jobs. The time people spend, their sort of happiness and it should be elevated from a reporting standpoint, for people who are running the management process, around who are the people involved? When are they involved and what are they doing? And so, we noticed that across all of the apps that are really participating in this space, no one’s really doing that. No, one’s really highlighting and elevating these people. And ultimately, the thing about incident management, right, is if it’s gotten to this point, it is a people problem, right? Your automations have been exhausted. You haven’t been able to stop whatever this critical situation is. You’re bringing responders in and it’s really important to take these people and make sure you’re aware. And so what we’ve done at a very simple level is, we’ve created heat maps of when these incidents are happening.

John Egan: And then we’ve mapped that over, to other charts that show you, who are the people, who are involved in them. So for example, you might be able to look at your incident heat maps as a customer of Kintaba, and say wow, my incidents are happening at 2:00 AM, on Wednesdays and Fridays, and Sheila is responding to 90% of these things, right? And then, Philip is writing all of the postmortems. And suddenly it’s not just about saying, how do I crack the whip and get these incidents to be closed faster, right. Suddenly it’s about, wow this is really impacting Sheila’s job. And minimally, we should probably be recognizing her at review time for putting in these hours after hours, right. And, being part of this incident response process. And beyond that, we probably ought to be looking into, what is the human impact of all of our incidents happening at 2:00 AM.

John Egan: And you can make real systemic changes, based on that knowledge. You can say, we should really stop allowing, for example, pushes to production, maybe between 1:00 AM and 5:00 AM, right? And this is what incident management is really all about. It’s all about, systemic learning and the ways that it makes your organization more resilient. And I think ignoring the people, is kind of like the worst thing you can do as you grow this process out, throughout the organization. So, it’s really that, at the end of the day, it’s a series of heat maps, giving you information about people, time and incidents and how they’re happening inside of your company.

Swapnil Bhartiya: Why are we doing it to now? And it seems that you folks are also the first one to, kind of approach it in this manner.

John Egan: Well, I think there’s two big reasons it’s happening now. One is, this culture of incident management is starting to spread more aggressively within companies and the companies that maybe historically weren’t already practicing it, which is really exciting, right? This is kind of the initial wave that matters in terms of adopting this positive culture, throughout organizations. But I think, second to that and maybe catching up to almost being equal to that, is the pandemic has pushed us into a world where remote work is, more real than it’s ever been in the past. And the recognition of time spent by people within the organization, isn’t really something we can gauge anymore by, whose car is in the parking lot, on your way out at 5:00 PM, right? Who do you see sitting at their desk, when an emergency is happening? And so I actually think this has a lot to do, with teams that previously weren’t particularly distributed, now being more distributed and wanting that information to be better elevated, because it should be taken into account.

John Egan: And I think historically, we sort of took it into account, almost as kind of a gut reaction, emotional, anecdotal understanding, right. Managers and even C-suite executives, right. They would sort of run based on their feeling, of who’s doing what and what they can see. And we’re not in that world anymore, right. We’re in a digital first world. And in that world, where these sort of heroes come into your organization, right, and save you from these major incidents. You won’t necessarily hear everyone cheering in the office next door, or on the floor above you. You have to have some other record of that information.

John Egan: And so I saw a really great tweet the other week that said, people don’t quit managers anymore, they quit on-call rotations. And I think, that’s this great element where historically when we’d say people quit managers, what we were really saying, is while we’re doing a terrible job of understanding, the impact of active managers, on their people and attrition. And I think, the tweet was getting at the same thing when it comes to on-call rotations, we do a pretty bad job of understanding the impact of being a responder. And, I think elevating those things as metrics and positive metrics, by the way, not negative metrics, but elevating them as positive things, within our tools, any of our real time work tools. I think just becomes more and more important, especially in the world we’re in today, where remote work is really starting to supplant, in person.

Swapnil Bhartiya: I want to go a bit deeper into it, from purely technological perspective. You explained, why you are doing it, why you are doing it now. But let’s talk about it from technology perspective. Talk a bit about, how these chart work and how companies can, look at these metrics, and actually, as you earlier mentioned, not only improve the performance, but also health of their employees.

John Egan: So I think, these charts aren’t particularly, technically complex, right? They’re purposefully very high level views into the people, who are participating in your incident response process. So, I talked a little bit earlier about, there’s a heat map. It’s very similar to like a heat map that you’d expect to see in something like a Google analytics, where you can go and get a really good idea of, who’s visiting your site, when. Similarly, we do a really good job of boiling down, when are these incidents happening that are requiring responders and giving you a high level heat map of that data, which is immediately actionable, right? Like as soon as you understand when your incidents are happening, you can start to put a picture together of what’s the impact that’s happening on the people, inside of your organization, who are involved. Then we also have a series of charts after that, that show the people who are participating in each part of the incident response process.

John Egan: So, who’s reporting these things, who’s responding to these things and who’s writing the postmortem, because it’s not always the same person doing each of those steps. And each step is valuable, in a different way for the organization, right? So, when you’re participating in the write-ups, right, you’re really contributing heavily to the knowledge propagation, right. And systemic change inside of the organization. If you’re a responder and you’re consistently a responder, right. We know that, okay you’re someone who’s coming in. Who’s really a subject matter expert in some of these areas, that maybe you weren’t even aware of. And who’s participating really directly, related to whatever those times are, that are laid out in the heat map. So it’s not that you would look at one of these charts or two of these charts and say, all right, this chart says that Phil needs a promotion, right?

John Egan: And this chart says that, Betty needs to come in and work more hours, right? It’s more a holistic view at the top that you can get, especially as sort of a C-suite executive, a CTO, a CIO, and understand how your people and your incidents are sort of interleaving to operate day to day. And the way we’ve built our reporting system is you can then start to drill down. You can say, okay, I really want to see this within this tag set, right? Show me the people, who are participating within my infrastructure team, show me the people who are participating, within our root cause having to do with, our onboarding templates. There are all these different angles you can take. And the whole system is built to be really easy, type your tags in, type your date ranges in, get a bird’s eye view of who are the people, and when are these things happening?

John Egan: And it’s funny, it’s one of those things that, I think we mentally do all the time, right? We’re always looking at metrics and data on our technical systems and we’re then mentally kind of running a pivot table in our heads and saying, okay, what does this really mean? Right. I know that there’s a trend upwards, say of incidents, or I know there’s a trend downwards of my SEV2’s, and we’re mentally trying to do this all the time. We’re trying to say, well what does that, how does that impact my team? Who’s working? What are my rotations? When are these things happening? And it just made sense to just take that cognitive load, right. Up and off of people. And so, I think when you’re evaluating this stuff as an executive, especially, we really just want to make it so that the next level of decision making is easier for you.

John Egan: You’re not having to first understand, who are the people and what are the timings and where are they impacted? And then try to make, kind of a second order decision off of that. We want to be able to give you that first order information, so that you can then go into the next level, which is to say, what are the practical actions I can take now that I have this information in front of me, right? What do I really want this heat map to look like in my organization? For example, at Kintaba, right, we strive really aggressively to have our incidents happen during working hours. And that means that we have our pushes, during working hours to production. We make sure that they happen early in the day. So we have lots of time for stabilization. And if we receive any customer reports, and we worry if we see any of these individual charts coming back, just showing one individual doing 90% of the work, if it turns out one person is doing all of the reporting, one person is writing all of the postmortems.

John Egan: First we reward, we say, okay, that deserves recognition. And then we follow up with systemic change. We say, okay, this on-call rotation is badly timed because it’s just working out. That Cole is always the on call responder. And even though we are happy with our heat map, we’re unhappy with that distribution of work. And, this is really what incident management is all about. It’s a cultural impacting, kind of a process. And so, those are the sorts of decisions we would expect you to be able to make based on the information.

Swapnil Bhartiya: I talk to a lot of companies, and they love talking about, the liability matrices. They like to talk about the SLA’s and all those things. But, from what I hear about Kintaba is that, you folks feel that, these are the thing that we talk a lot about, but this is not something that is really important. So why, I mean, of course, after listening to you, I do have very good understanding of why you say that, but I want to hear from you. So please share your thoughts and insights on that.

John Egan: I think there are really two ways to think about the incident management space and it’s really depth and breadth, right? And you have to think about it as an organization, which one you’re focusing on. And we’ve always taken the opinion at Kintaba, the breadth is the most important way that you can implement incident management. It’s not just about taking your small SRE team and making them like maximally 1% more efficient, and then one more percent efficient, right? We find this to be sort of a negative feedback loop that encourages people to do things like, not report incidents, or try to work the charts, so that your overall MTTR is going down, right. What we really encourage organizations to say is, here’s a set of tools available to you inside of your organization to take any critical work, right? Any critical, real time thing that’s happening inside of your company that has sort of a date based restriction around it.

John Egan: That requires a collaborative response, right? And if you start to think about incidents that way, you realize that all these other parts of your organization are impacted, right. Your customer success organization is probably dealing with what we would call incidents in the SRE world. Your sales org is probably doing it, certainly your product engineering org, and all of these parts of your company, when you start to look at them, you realize they don’t really have that last 1% problem. They have that first 25% problem, which is they’re just hacking tools together to deal with these real time situations. And Kintaba has always taken that position. And this is another step in that direction. And I think we’ll continue to do this, where we’ll continue to build out features that you’re not going to see coming from other companies, that are working in this space, because we care so much, about whole organizations being part of this positive culture.

John Egan: And I think this is great for SRE too, right? This SRE is really the Vanguard. They’re the beginning, they’ve really distilled this process. And, so one of the first things that happens when you install Kintaba is, you get a public dashboard of all of this information, which sounds really simple, but most organizations, that are kind of operating on Excel sheets and maybe just slack, really have this information kind of tied down. The rest of the organization doesn’t even know, that you have a part of the company that’s doing response. That’s doing, blameless postmortems. That’s doing a really public and healthy approach to real time critical situation response. And, opening that up is really step one. And I think what you’re seeing here now from us right, is okay, now we’re getting into step two and step three. And, that’s really our charge. And, that’s our dream as an organization. How do we make organizations more resilient? Not just, how do we make one or two people?

Swapnil Bhartiya: One thing that I have been wanting to ask you for a very long time, and that is that, today when we look at teams, they all are using Slack channels. They are, they jump on zoom call four or five times a day. So, why is there a need for a tool like Kintaba, when there are a lot of Synchro tools already existing in the market?

John Egan: So, the structure of Slack really prevents what’s necessary during a major incident, right? What a major incident really needs is, it needs very defined spaces, for what’s happening. Those spaces need to have metadata on them, tracking the individuals who are participating, tracking the milestones, tracking the progression, and then they need to have end, right? They need to have a conclusion, where this thing is wrapped and closed. And then the follow on actions that happen after that. And Slack does great job of giving us a collaborative environment. Kintaba has a very deep integration with Slack because its actual chat interface is fantastic, right? But, what Slack sort of does by default is, it encourages you to create channel spaces that are permanent, right? You’re creating, engineering channels and infrastructure channels. And the minute you get to point where you have more than one topic happening at once, you’re either getting incredibly reliant on threads, which can be pretty difficult for people, or you’re ending up with sort of a mess of information coming across a single channel, that’s not traceable.

John Egan: So if you think about the world, the way we do, if you say this is a breadth problem, and your number of incidents inside of your company should be increasing, as more people are doing it. You’re going to rapidly overload, kind of the default constructs of Slack. So what you need is, something like a Kintaba to come in and start to be your orchestrator on top of Slack and say, here’s the channel, where we’re dealing with this problem right now. Here’s the metadata that Slack doesn’t natively capture. And then here’s how we’re going to make sure that channel’s closed and archived. So it doesn’t muddy up your channel list, like everything else.

John Egan: I think Slack is really fantastic for… I’d almost call it undirected work, right? You need these spaces. It’s sort of like a, it’s like an office space. You come in, you can do all kinds of things in that space, but Kintaba is really great at being a war room and turning Slack into those war rooms. And so, I never really say we’re in competition with Slack, right? Slack is the office, but can Kintaba is the space where you’re getting something critical, done. And there’s a difference there that I think is pretty fascinating.

Swapnil Bhartiya: John, thank you so much for taking time out today and talk about kind of an interesting, different take on incident management or incident response, which is like focusing on people. So thank you for sharing those insights. And I think, that’s timely given the fatigue that we all feel looking at their screens, especially in the pandemic phase. So focusing on health of employees is also important. So, thanks for not only for the tool, but also sharing those insights that you mentioned, those C-level executives should look at. So thanks for the discussion today. And as usual, I would love to have you back on the show. Thank you.

John Egan: Great. Thank you for having me.

Read Transcript
Don't miss out great stories, subscribe to our newsletter.

Login/Sign up