Investigating incidents can be problematic and it is easy for SREs and engineers to get bogged down in the investigation process. The investigation of anomalies can quickly get expensive, and collaboration with others can be complicated. Lightstep Notebooks guides SREs and engineers through the entire analysis, from metrics through to the investigation. It is also collaborative and can be snap-shotted for future analysis.
In this episode of TFiR Newsroom, Ben Sigelman, Co-Founder and CEO of Lightstep, takes us through their recent announcement of Notebooks. On discussing the challenges of troubleshooting issues, he says, “People are getting drowned right now and are having trouble actually investigating things in a way that is collaborative and efficient. I think that notebooks and the way that we’ve implemented them go a long way towards resolving that.”
Key highlights from this video interview are:
- Lightstep is predominantly focusing on managing complex multi-service distributed systems through the life cycle of an incident. Sigelman discusses their latest product in their portfolio, Notebooks, and how it compliments their second product, Lightstep Incident Response.
- Sigelman describes how it can be difficult to set up a collaborative space to investigate potential real incidents. He explains how investigating an anomaly with metrics can be expensive and does not necessarily lead to understanding the issue. He describes how Notebooks help to streamline this process.
- Observability has two main components to it, core monitoring and understanding changes in the monitoring and the causes. He goes into detail about how Lightstep’s Change Intelligence combines these two elements into one tool and how it fits in with Notebook.
- Sigelman shares the two components SREs and engineers need to better handle these complexities with visibility in changes in the monitoring: intelligent ranking functions and having a tabular view showing the change before and after. He explains how Change Intelligence works and can simplify this process.
- Although incident response can often be seen from a technology aspect, the business aspect also plays a role. Sigelman shares where they are currently on their roadmap and the benefits they have seen from joining ServiceNow. He discusses how they have been able to better understand business issues and the performance customers are having.
The summary of the show is written by Emily Nicholls.
Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed.
Swapnil Bhartiya: Hi, this is your host, Swapnil Bhartiya, and welcome to another episode of TFiR Newsroom. And today we have with us, once again, Ben Sigelman, Co-Founder and CEO of Lightstep. You folks are announcing Intelligent Notebooks, so tell us what exactly it is.
Ben Sigelman: Sure, and thanks for having me. It’s always great to be here. So Lightstep today is announcing our Notebook’s functionality. I think, if I was to step back a minute, what we’re really trying to focus on here is managing complex multi-service distributed systems throughout the life cycle of an incident, right? So we recently announced Lightstep Incident Response, which is a second product in our portfolio, which allows us to have a soup to nuts approach to managing the life cycle of an incident from detection to notification, all the way through to investigation resolution. But when we’re getting into the investigation piece, the thing that we really need to be able to accomplish for our users who are SREs and engineers who are managing these incidents, they need to have a place where they can investigate these issues collaboratively and by digging into any aspect of their application, regardless where it was or what type of telemetry it’s emitting.
And what’s unique about Lightstep’s approach to Notebooks, I think, is that we’re able to guide people through that investigation, again, through the entire flow without ever having to leave Lightstep. And they can have a guided experience from one type of data to another, and from one service to another given the technology that we’ve built into the Notebooks experience. I can dig into that in more detail, but I think people are getting drowned right now and are having trouble actually investigating things in a way that is collaborative and efficient. And Notebooks, I think, and the way that we’ve implemented them goes a long way towards resolving that.
Swapnil Bhartiya: Right. Yeah, I do want you to go a bit deeper into it so that we do understand. There is a problem area, we also know what you’re trying to solve, but how you’re doing it, that would be great.
Ben Sigelman: Yeah, sure. So, I mean, let me talk about how this often works, I think, Pre-Lightstep, right? So typically people are alerted to something by their monitoring system, which is usually built around metrics. Okay, great. For what it’s worth, I think that piece makes sense. You should be using metrics to do a lot of your core monitoring. You then jump into some kind of chart that looks like a squiggly line. There’s going to be some kind of spike in that chart. And then there’s this immediate question of how to collaborate and to investigate this with other people. Because if it’s a real incident, you’re probably going to be bringing other people in as well. Now, Notebooks are really helpful because it gives you a shared place to do that investigation. Where this gets special with Lightstep is that if you hit a wall with metrics, and you will, because the only way you can investigate an anomaly with metrics is by filtering and grouping your metrics by attributes or tags.
And that gets really expensive very quickly because the only way to do that is to add high cardinality tags and attributes to your metrics, which in turn creates a huge bill. But even if you do that, all you’ve done is turned one squiggly line into hundreds of squiggly lines. And if you try to find the one that explains the anomaly, that’s still not enough to really understand the issue. And people will then pivot over to other technologies and other systems to do observability and to investigate that change. What’s special about Lightstep’s approach to Notebooks is that we streamline that entire process like we recently… Last year, we announced Lightstep’s change intelligence functionality, which allows you to find an anomaly in metrics to just click on it, ask Lightstep to analyze that deviation, and we will find, even if we have to cross from metrics over to tracing, even if we have to cross boundaries from this service to one of its descendants or something up above it in the stack, we’ll do that analysis automatically.
And that’s also all built into our collaborative Notebooks experience. So instead of having to hit a wall with metrics, and then pivot over into other tooling, Lightstep can guide you through that entire analysis, starting with metrics, moving over into an investigation of not just the tracing data, but the service dependencies that that tracing data encodes, and the sorts of workload changes and infrastructure changes that will result in the incident in the first place. And that entire thing, that entire investigation, it’s collaborative, it’s encoded in the Notebook, and that becomes a shareable asset that’s snap-shotted for future analysis as part of a postmortem or in some incident review, or otherwise, right? So we’ve done a lot to package up what was previously several different tools and a bit of a wild goose chase as you pivot from metrics over into the tracing workload. We’ve done that all within our Notebooks functionality, which I think really is quite powerful. That context, which is very expensive, and we’ve removed it from the process.
Swapnil Bhartiya: When you folks announced Change Intelligence, we sat down and we talked about it. But it’s been a while and since then it’s going to play a very critical role here. So If you can just remind our viewers what it is, how it works and once again, how Notebooks are going to leverage it.
Ben Sigelman: Yeah, absolutely. Thanks for asking. So, I mean, I’ve been working on observability stuff, I’m embarrassed to say, for half of my life at this point. I started working on this in 2003, basically and there’s a lot of talk about observability, and it’s become kind of a hot topic, but observability really has two pieces. One is core monitoring, which is very important. Observability is not just a new word for monitoring. Monitoring is just an important aspect of observability. So observability has to have really high-quality monitoring built in. And so, that’s the place where you discover that something bad has happened. So you’re monitoring things that you already know about, that you know are important, and you want to understand early if they’re going off track, right? So that’s monitoring. Then the rest of observability, and this is the part that’s really tough, is that when there is a change in your monitoring, you have to understand what caused that change.
Understanding changes is the hard part of observability and where most people’s observability strategies are falling short today. These can be both reactive changes like we’re talking about today with instant response. It can also be plan changes, CICD is an example of a plan change where you’re intentionally changing your system, and you want to make sure that you minimize risk and minimize the time it takes to make that change, right? But observability comes into understanding these changes. And what Lightstep has done with Change Intelligence is to take essentially two workloads, a controlled workload, which is a baseline, and a deviation workload, which is when there’s an anomaly or when you’ve made a release.
And we understand not just the core signal of how has your monitoring changed, but we look across the entire distributed system to understand what has changed in your workload. For instance, if a single customer comes in and changes their workload by a hundred X, which often does happen in B2B applications, Lightstep will detect that workload change. If it is the reason why your CPU load has spiked, your database is falling over, we’ll detect that change dynamically. And what Change Intelligence is all about is guiding our users through the many, many different candidate signals, not just within the service that they started with, but across their entire distributed system. We guide them through these different hypotheses, and we help them understand, using data, which ones are most correlated and most explain the change that they’re investigating. To bring us back to Notebooks, Notebooks will often start in some sort of incident investigation or trying to explain an anomaly. So you see something weird, you create a Notebook out of that chart and then you can go from there. And Change Intelligence is built into that workflow.
So in Lightstep, you can click on any sort of anomaly in one of those charts, and we’ll bring you into that guided analysis of the entire system. I think the thing that we’ve found is that during an incident, it’s not that you can’t find correlated changes, it’s that there are too many of them. If you’re having an incident that’s actually a major thing, that’s affecting your product, it’s likely that that’s causing issues up and down the stack, left and right in the stack. And so just merely saying, “Well, what else is changing at the same time?” is not sufficient. You need something that can actually help you understand the relationships and the cause and effect between these different pieces. And that’s really what we’ve accomplished with Change Intelligence. And so building that into the Notebooks functionality streamlines that workflow considerably, and in a way that I think is highly differentiated as well.
Swapnil Bhartiya: It seems way too intimidating and too complex. How do you really simplify it so that, once again, the teams are not getting overwhelmed by them?
Ben Sigelman: Yeah. So I think the thing that people need, they need… Excuse me. They need two things. One is some sort of intelligent ranking functions. As I was saying, there are a lot of things that go wrong at the same time, that’s normal in a distributed system so you need an intelligent ranking function. Lightstep has built that, we use… Lightstep, at its core, I think is the most sophisticated distributed tracing system that I’ve ever seen, and I’ve been working on this stuff since the early days of distributed tracing. So we use that system and that technology to rank the various things that are going on and to bring the ones that are most relevant to the top. And then the second thing you need is evidence. I think it’s actually not sufficient to tell people, “Hey, something bad just happened. You told us something bad just happened. Now, here are the top 10 potential explanations,” and just leave it at that. That’s really not that useful.
The second piece of Change Intelligence is to actually bring out a diagram of this change before and after, a tabular view showing how… Let’s say, for instance, that a single customer changed their behavior. We’ll show you how did that customer change their behavior, they went from 10 requests per second to a hundred requests per second. Between these two different time intervals, their latency increased by this much, their error rate increased by this much. And then we’ll also show you how did all the other customers behave during the same interval so we’ll allow you to see the data about how the thing that we’ve pulled out compares to everything else. And these are the questions that you would normally be asking yourself subconsciously or otherwise, and then you go on a kind of a wild goose chase to try to answer with bits and pieces of data.
In Lightstep’s case, we’ve analyzed many thousands of transactions to give you a solid statistical explanation for what’s going on, always with the option to go and dig into individual examples. So in Change Intelligence as part of Notebooks, you can go from saying, “I just got woken up,” bring some people into the Notebook, click on change intelligence, see a hypothesis about a customer changing their behavior, and see how that customer deviates from everyone else, and see specifically what they’re doing with very detailed distributed traces built on top of OpenTelemetry that tell you exactly what was going on. And that’s one unbroken flow. You never need to bring out your query editor, you never need to write custom queries, you never need to construct things from scratch.
That’s just by clicking on the thing that’s most interesting to you and getting to that guided explanation. I think most of these workflows fall short for people because they’re forced to go and write custom queries, which is really hard to do on a good day, but during an instance, impossible, or sift through lots of examples and come up with a hypothesis. Lightstep will sift through thousands of examples and come up with a data-driven hypothesis for you. And I think, that in my mind is where a lot of the time savings come from, and where we take a lot of the cognitive load out of incident management, incident response.
Swapnil Bhartiya: I want to talk quickly about the business aspect because we talk about technology, which is good, but in the end, once again, we are trying to solve a business problem. So can you also talk about what role does it play in the business? What kind of awareness you have seen about SREs or these things which reflect to the customer success teams, or it reflects directly on the businesses? So that we do understand the importance of it.
Ben Sigelman: Yeah, that’s a really good question. I mean, about a year ago, Lightstep joined ServiceNow, and although we’ve maintained our own roadmap, and our own engineering headcount, and a lot of autonomy and growth within ServiceNow. We’ve also taken advantage of being part of their overall platform of platforms, right? So one of the things we hear a lot from customers, especially in large enterprise environments, is that you do have Cloud Native organizations who need to use a tool like Lightstep, and Lightstep Observability, and Lightstep Incident Response to manage their applications. However, that’s not the end of the story. There’s also this central nervous system of the enterprise where you’re doing major incident management, where you’re managing relationships with key customers, things like that. And that depends on ServiceNow. And what’s interesting about this is that as part of ServiceNow, what we’re discovering with these Cloud Native apps, is automatically in the background, harmonized and synchronized with that central nervous system of ServiceNow.
And that has tremendous benefits in resolving the sort of business issues attached to this. What we hear a lot is that the IT operations function that depends on ServiceNow, they have to manually create their own copies of incidents that are happening in Cloud Native applications within their own organizations. And it’s very error-prone, it’s also lossy, and pretty inefficient, and introduces not just operational inefficiency, but even compliance risk. And so something that we’re doing as part of ServiceNow, is plugging all this back into the ServiceNow platform so that it’s harmonized with those business processes. You mentioned key customers, we’re also working with customers where… For instance, ServiceNow has a customer service product, CSM, that’s used for customer-facing ticketing. Having Lightstep plugged into ServiceNow allows us to drill into key customers of our customers, right?
So an end-user, or a B2B customer, can be having an issue. Lightstep can understand what is the actual performance that customer’s having and have that proactively notify a customer service rep before it becomes a significant problem, or even just to create a shared source of truth about actual performance, reliability, et cetera. And that sort of outcome is really powerful from a business standpoint and is something that we’re able to do as part of ServiceNow’s platform of platforms. So I think there are a lot of business benefits to what we’re doing well beyond the engineer or SRE, who’s on-call using Lightstep to investigate that incident. All the work that they’re doing is harmonized to that platform and made available to other business users. And that’s really exciting to me. It’s something I’ve wanted to see for a long time.
Swapnil Bhartiya: Ben, thank you so much for taking time out today and talk about Intelligent Notebooks and, of course, your [inaudible 00:15:10] as well. And as usual, I would love to have you back on the show. Thank you.
Ben Sigelman: Hey, thank you very much.