Cole Potrocky, CTO and co-founder of Kintaba, talks with Swapnil Bhartiya about incident management.
Kintaba is dedicated to modern incident management, enabling companies to better respond to major incidents and outages. The company got into this space because they considered it a learning problem.
According to Potrocky, “We figured failure is sort of a constant theme, whether you’re a one-person company or whether you’re a thousand-person company. You learn through getting to the periphery of what failure is, and then you reflect on that failure.”
Kintaba sees incidents as “black swans,” which are unpredictable events. One major “black swan” Potrocky points out is the COVID pandemic, which could not be predicted. These types of events require businesses to solve problems in completely novel ways. Such events, which can often lead to chaos engineering, can help companies learn about their systems and what weaknesses can be found within.
Potrocky worked on the task management system at Facebook for several years. He adds, “We ran into this point where a huge percentage, probably the majority of tasks being created at Facebook, were from automated systems and people would either ignore them or they would get closed out immediately.” This led Potrocky to believe there should always be a human operator who can declare such an incident as a priority. Your business can still employ automation, but the human factor is still important.
With this in mind, Kintaba has turned to Slack, which is a great tool to help businesses “deformalize” the process. According to Potrocky, using Slack makes it possible to very quickly create an incident and pull in whoever you want with the Slack invite flow. Kintaba is really focused on what’s actually going wrong, instead of “How did this happen?” and “How can we categorize this?” This deformalizing of the incident process has led Kintaba to build an app, called Decider, that allows you to create a breakout room in Slack, which is a public channel so people can look at your decisions after the fact. With this breakout channel, Decider allows you to draw conclusions from each decision. This also makes it possible for anyone to view how a company has made its decisions, which (according to Potrocky) is “super vital because that is where creativity happens. It’s having that information available and allowing people, external to you or your team, to look at it because they might have an outsider’s insight that helps you make better decisions in the future.”
Decider is available for free in the Slack App Directory.
Summary for this interview/discussion was written by Jack Wallen
Here is the edited transcript of the interview.
Swapnil Bhartiya: Welcome to TFiR Let’s Talk. I’m your host Swapnil Bhartiya. And my next guest today is Cole Potrocky, Co-Founder and CTO of Kintaba. Cole, it’s great to have you on the show.
Cole Potrocky: Yeah. Great to be here.
Swapnil Bhartiya: We have spoken with Kintaba earlier, and today we want to focus more or less on the incident management and how Kintaba shines or stands out. Before we start this discussion, could you tell me the importance of incident management? Why should companies care about it?
Cole Potrocky: Yeah, so we actually got into the incident management space because we considered it to be a learning problem. We figured out how companies grow and how they become resilient. And we figured failure is sort of a constant theme, whether you’re a one-person company or whether you’re a thousand-person company. You learn through getting to the periphery of what failure is, and then you reflect on that failure, and then you grow stronger through the insights of failure, but you must allow yourself to actually get there and you have to be honest. And that’s what the incident management approach we think is. We think it’s primarily a human problem where you bring people in who know how to fix the problem, you fix the problem, and then you reflect on it. We think the reflection process is actually incredibly important where you ask yourself what went wrong and why did it go wrong? And then of course, the most important part of that post-mortem process is how you ensure what went wrong doesn’t go wrong the same way again. It’s not to eliminate all failure. It’s not to avoid incidents or problems.
Swapnil Bhartiya: Is incident management more about, “Hey, something went wrong, let’s reflect on it and let’s try to make sure that it will not happen.” Or, it’s more or less like chaos engineering, where you do go in there, prep things so that you can avoid any such instances from happening. And how do you also automate things so that you remove the human element out of it?
Cole Potrocky: We often see incidents as “black swans”. There are things you can’t predict, right? Like COVID would have been the ultimate “black swan”. You couldn’t really have predicted that. And so when those types of things happen, you’re solving problems in completely novel ways. And there is a lot of value in going through processes like chaos engineering, because they help you learn about your system and get educated about your system and what its weaknesses are, how it can fail over. But I worked on the task management system at Facebook for a number of years. And we ran into this huge point where a huge percentage, probably the majority of tasks being created at Facebook were from automated systems and people would either ignore them or they would get closed out immediately.
And we ran into this issue of when people see things that are incidents or tasks that are cut from automated systems, they ignore them. They don’t take them as high priority and high value. So like we think there should always be a human operator who is in between like a power outage and actually declaring an incident and waking up all your coworkers. And you can still have that automation, but there should be a human always involved in the process, always confirming things, because I just don’t think anything in AI is there enough to come in, diagnose the problem, solve the problem with a truly novel incident.
Swapnil Bhartiya: I’ll just go back to the earlier point that you were talking about that why you build what you saw in that space. And I also would like to talk a bit about your experience at Facebook that also helped you in building the company and the solution.
Cole Potrocky: The whole thing with Facebook was that the automated approach didn’t work. And there’s such a volume of things like to-dos and tasks that didn’t, that people really didn’t care about. And incidents at Facebook were always run really well. We had this internal tool called SEV manager when an incident was declared, it was always high priority, but that was largely because Facebook didn’t link their internal systems, their internal automated systems with SEV manager. I recall like if there were an egress drop and they saw there was like 500 gigabits per second of bandwidth that wasn’t being served anymore. Those wouldn’t be automated systems, cutting incidents, a person like some engineer would have been watching the graphs and seeing this huge drop and been like, oh crap. Like, we need to cut an incident for this because often, you know, like often your machinery goes wrong too, right?
Occasionally you might see like, oh, this might be an egress drop. It’s your process for determining egress and calculating it is off. And you do not want to get in the way of engineers who are working, who have their protected computational time, because that destroys creativity. It destroys people actually getting their work done outside of incidents. Like you want to constrain your incident space to be a really small space so that you get incidents diagnosed and fixed really quickly. And then your engineers can get back to work on tasks that they deem relevant. They deem important, and they have control over their schedules as much as is truly possible.
Swapnil Bhartiya: Now that we have SRS, so from an incident management perspective, do you think that it is becoming an organization-wide solution or approach, or still kind of limited to SRE teams? What are you seeing there in reality and what would be the ideal way?
Cole Potrocky: So we were hoping to really expand outside of the SRE space, but we’re really seeing SRE teams continue to drive these incident responses. One big component of why I think that’s the case is because incident management, the traditional incident management approach, is a lot of process. It requires a lot of effort to do right. It requires a lot of formalism. It requires you to really psychologically take note of how you failed, which is difficult as an individual and as an organization it’s even harder. So what we are really hoping, and we think this can still happen is that everyone can take place in incident management, but that requires a huge deformalization of the incident management space to try to reduce the incident management response down to its most important constituent parts. We think that the post-mortem is probably one of the most important parts of that, which is just writing your conclusion.
You can just say, it’s the conclusion of the decision we made, or one thing we could improve. And we’re, we’re big believers in tiny habits. Five second post-mortem would just do one thing, make it easy, make it fun, make it light. And that’s part of the incident management approach. Like no company is going to adopt this approach of failure or difficulty of looking at yourself clearly, because no one adopts a habit, unless it’s fun. I always say , I have never met anyone who does burpees on a regular basis because they’re painful and terrible. Even the people I know who are in great shape, all things that you do regularly must have some element of fun and interest, and incident management hasn’t historically had that because it’s been like this really formal tool.
Swapnil Bhartiya: Can you talk a bit about what your solution looks like? Do you offer something that people can install on-prem or on their cloud, or it’s SaaS or ISP service?
Cole Potrocky: I can talk as this is a SaaS. However, we do have the ability to run on-prem if people need it. I know everyone today is really worried about their data security and we take that incredibly seriously, often enough, that means run out of your own data centers that you trust, often enough that means today that you’re running on AWS, Google cloud or Azure. So we’re also thinking about exploring other solutions, like allowing people to use our SaaS infrastructure without necessarily having to trust us fully, bringing your own key type solutions. So we really understand that, especially because incidents are places where you talk about a lot of the ugly parts of running a company, and that failure necessarily is something you want to keep secure. And can we take that super, super seriously?
Swapnil Bhartiya: No. Since you talk about the solution to that, it’s a SaaS for cloud, but people can run on-prem as well. There are other incident management tools as well. So what unique value does Kintaba bring to the market or what features or advantages do you have that people should prefer your solution?
Cole Potrocky: I mean, truly, I think we’re bringing like that human-centric idea of building tooling. We bring that, but I think a lot of our other competitors are thinking about automating everything. Like we do believe in automations; Kintaba has a great automation suite, but we believe in automating. Once you find failure states using the Neta concept of failure and incident management and applying it to your incident management tool, we recommend teams come in here and they just try things without any automations, without any of the bells and whistles, they see how things work and how things don’t work. They figure out how their incident management process can be automated and improved and they find rules. Automation rules really help them with that.
So while our competitors think, “Hey, we’re going to have machines do the whole job for you,” we think that machines should always help you do the job, but you should always be the person in control. You should always be the person doing the job, clicking the big red button. And our approach really shines there. Like it’s really built for people to use. It’s built around a lot of the US principles. We knew about it. Facebook really helped people find information that’s relevant and keep everything else in the background.
Swapnil Bhartiya: You use the word human centric. If I ask you now, how do you kind of allow these teams, whether they are still SRE silos or spread across organizations, so they can collaborate better with each other in real time, because sometimes you have to react fast. So, talk about what initiative you have there.
Cole Potrocky: Yeah. So we have a really great Slack-first experience, and we’re trying again, we’re trying to deformalize things and Slack is a really great tool to help you deformalize processes. Like you can just type “/kintaba new” and create an incident very, very quickly. And you can just pull in whomever you want just using Slacks normal invite flow and just start figuring things out there. One thing we really don’t recommend is trying all sorts of metadata and categorization when you’re solving an incident. And, we’ve seen this in our competitors where they’re so worried about the organization and I’ve constantly been droning on and on with this phrase, “Don’t mistake the appearance of order for the order itself.” So we’re really big on focusing on what’s actually going wrong, not getting involved too much in how does this be? How can we categorize this?
How can this fit into our mold of what we think where this incident should be filed afterward? We want you to actually do what you’re supposed to do during an incident, which is going to be messy. It’s going to be a little difficult and it’s going to be sort of hard and you don’t want to have to deal with all these other things going on along the way. You want to find your people who are fixing the problem and you want to protect their time so that they can fix that problem quickly and coolly because people always do better when they’re feeling relaxed.
Swapnil Bhartiya: I’m aware that a new Slack app is coming out from Kintaba. Tell me more about it.
Cole Potrocky: Yeah. So in that theme of deformalizing the incident process, we’ve started breaking apart every incident management process into its constituent parts. So we built this app that is very simple. It allows you to create a breakout room on Slack, a public channel so that people can look at your decisions afterwards, you can invite people. And we only invite people who are online so we can protect people’s time. If they’re not online, or they’re busy doing something else. And you sort of hide how the sausage is made. You talk about the decision you want to make. Maybe it’s because we have these marketing materials that we have to get out ASAP. We have to make a decision on which versions we’re going to go with. Maybe it’s something more like incident management, where it says, “Hey, we have to figure out this problem”. Let’s do this breakout channel. And then Decider allows you to draw conclusions from each decision before you close out the channel, which are the top-level decisions we made.
What it does is it starts breaking down silos by getting rid of some of the reasons, the bad reasons that people prefer silos. One of them in Slack is “I don’t want these conversations muddying up all of the other conversations taking place”. So with Decider, you can just create a breakout room, figure out what you want to talk about, have the top-level conclusions for anyone who doesn’t care about how you got to those conclusions.
And of course, the important part is anyone can go in and see how you made your decisions. And that’s super vital because that is where creativity happens. It’s having information available and allowing people external to you or your team to look at it because they might have that outsider’s insight that helps you make better decisions in the future. And of course you can always search through Slack, maybe find some problem you’re dealing with today that isn’t actually so novel after all. You can just go and find it and it’s public and there you go.
Swapnil Bhartiya: Is it separate from your offering or is it innovative?
Cole Potrocky: So Decider is totally separate. It’s a very, very simple Slack app. We’re giving it away for free. And it’s really about just getting people more into better remote work solutions and dealing with those time-bound high priority tasks that often are incidents. Because again, we’re big believers in tiny habits. We think that you start small as a company, as individuals, and then you get bigger and bigger. And then the idea is you prepare people for incident management, you create a process, people enjoy, and then maybe when things are going really terribly, you’re not so down about things because you’re used to it and you ramped up to it versus just diving in deep and running into all the problems that entails.
Swapnil Bhartiya: Is that documentation live inside of Slack or also available outside of it so others can benefit from it?
Cole Potrocky: It’s totally tied to Slack. This is really around once again, incidents or company documentation at the end of the day. And I think that’s what you’re sort of alluding to, is that you do want to extract the insights of decisions so that other people can look at them often enough though, again, it’s around that in formalization process, large companies have wikis and documentation about previous incidents or processes, but I have almost never seen a company that has really up-to-date internal docs. They’re usually just immediately out of date and often enough, if you have all these decisions in slack, you can just sort by time and you can just refer back and you can say, well, the decisions that were more recent or more likely to be more correct, and the decisions that are less recent are more likely to be less correct.
And it really creates this idea of having a little bit of discernment around the knowledge you’re reading and a wiki doesn’t give you that it’s this definitive source. But as we know today, there are no definitive sources. You have to be able and willing to read for yourself and to understand, and there are no shortcuts to understanding. You have to just pay attention and see what you can learn.
Swapnil Bhartiya: Cole, thank you so much for taking time out today and talk about Kintaba incident management and the new free tool that you are releasing to users. And I would love to have you back on the show. Thank you.
Cole Potrocky: Yeah. Thank you.