Cloud Native ComputingDevOpsFeaturedLet's TalkSREs

Lightstep Extends Beyond Observability With Lightstep Incident Response

0

Guests: Ben Sigelman (LinkedIn, Twitter)
RJ Jainendra (LinkedIn, Twitter)
Companies: Lightstep (Twitter) | ServiceNow (Twitter)
Show: Let’s Talk
Keywords: Observability, Incident Response

Lightstep recently launched Lightstep Incident Response, which aims to seamlessly bring together observability and incident response. This is the first product company has launched since its acquisition by ServiceNow. Ben Sigelman, CEO and Co-Founder of Lightstep, and RJ Jainendra, VP of Emerging Business at ServiceNow, joined us to talk about this announcement, evolution of Lightstep beyond observability and more.

When asked why an observability player, Lightstep, is entering the incident response market, Sigelman said, “The idea that the right solution for minimizing downtime would be to have two separate products, one for observability, and actually two separate vendors, even for observability and incident response. It just doesn’t make that much sense.  The most obvious thing we could do is to integrate those into one platform and offer them under our brand.”

The SaaS product aims to make it easier for developers to manage incident response, bringing together the right people across different teams and services with collaboration tools such as Slack and Zoom integrated so that the response can be dealt with quickly and efficiently.

The company was keen to make the incident response solution self-serviced, allowing developers and SREs to easily sign-up and get access to the product straight away. Customers can quickly set up alerts using Lightstep observability and other tools like Datadog, set up the on-call team, and response rules on how to process the alerts as they flow in.

Lightstep Incident Response has also moved away from the traditional approach of paying per user instead opting for a price for usage. The incident-based model means that every developer can be involved in service ownership, whereas previous seat-based products presented challenges around deciding which developers should be a part of the incident management process.

Although the move to cloud native and microservices aims to accelerate release velocity enabling every small team to be making changes concurrently, it comes with side effects. If the side effect is in the service being deployed with a CI/CD pipeline, it may be possible for the release to be rolled back avoiding outage, but this is still costly. As the majority of outages come from an unintended side effect from an intentional change affecting another team, one of Lightstep’s priorities from the start has been helping to diagnose issues that cross service boundaries and cross team boundaries.

Traditionally, developers may use four or five different technologies for monitoring and learning and separate solutions for collaborating and bringing together the right people. By bringing together the capabilities of observability and incident response under one platform, Mean Time to Recover (MTTR) is reduced, meaning more reliable, resilient services.

“We believe that a platform like that would really help customers achieve that goal that Ben was talking about, which is reduced MTTR, and deliver these highly reliable and resilient services,” said RJ Jainendra, VP and GM for Emerging Business at ServiceNow.

About Ben Sigelman: Ben Sigelman is a Co-founder & CEO at Lightstep, a company that makes complex microservice applications more transparent and reliable. He is an expert in distributed tracing and also co-founded the OpenTelemetry project.

About RJ Jainendra: RJ Jainendra is VP and GM for Emerging Business at ServiceNow focused on creating products with a Product-Led-Growth (PLG) go to market. He has worked in tech for over 20 years with a majority of his career building tools for developers and IT.

About Lightstep: Lightstep’s mission is to provide clarity and confidence to the teams that build and operate the software that powers our daily lives. Founded by ex-Googlers, the cutting-edge observability platform gives engineers quick insight into how changes in their applications and infrastructure affect their end-users and their business.

About ServiceNow: ServiceNow makes the world work better for everyone. Their cloud-based platform and solutions deliver digital workflows that help organizations find smarter, faster, better ways to work.

The summary of the show is written by Emily Nicholls.


Here is the full unedited transcript of the show:

  • Swapnil Bhartiya: Hi, this is your host Swapnil Bhartiya, and welcome to TFIR newsroom. And today we have two guests, Ben Sigelman, CEO and co-founder of Lightstep, and RJ Jainendra, VP of Emerging Business at ServiceNow. RJ, Ben, it’s great to have you folks on the show.

RJ Jainendra: Swapnil, it’s really nice to be here.

Ben Sigelman: Yeah, thanks a lot.

  • Swapnil Bhartiya: Today. We are going to talk about Lightstep’s incident response announcement. But before we go there, Ben, I’m kind of curious, we had been speaking for so long. You folks created a lot of Open Source, Open Telemetry, everything there. But you are traditionally known as an observability player. We have also discussed there’s the line blur so much when you solve a problem. So where is this incident response product coming from? Tell us about the launch of the company ServiceNow.

Ben Sigelman: Yeah, it’s a great question. So Lightstep from a durability standpoint has always believed that there are really three layers to this pollution. There’s the Telemetry itself, where, as you mentioned we’ve spent a lot of time and effort getting Open Telemetry to where it is, as I think as the kind of defacto Open Source Telemetry solution.

There’s a storage layer, which I’m not going to talk about today because it’s off topic. But then the top layer, which is what the value is actually delivered as workflow. From an observability pure place standpoint, workflow has been quite fragmented for most of our… kind of what I would consider to be legacy observability players, where you have separate skews and tabs for infrastructure monitoring, and APM, and logging and things like that.

So we’ve been trying to address that standalone, but when we talk with our customers, the two things they’re really trying to accomplish are increasing their velocity and increasing their uptime. And increasing uptime is basically a matter of Incident Response and minimizing MTTR.

And the idea that the right solution for minimizing downtime would be to have two separate products, one for observability, and actually two separate vendors, even for observability and Incident Response. It just doesn’t make that much sense. Like the most obvious thing we could do is to integrate those into one platform and offering under our brand.

Of course, we’re not going to require that customers do that, but I think customers actually want that. They want it to be a seamless experience where you don’t lose context. And that’s what this is all about. RJ, do you want to kind of talk a little bit about your take on Incident Response and the opportunity there?

RJ Jainendra: Yeah, thanks man. That’s exactly right, in the sense that observability is the site that provides insights when things are not working or there is some unplanned changes, the response, the reaction side of it is really from what Incident Response brings in. So I sort of use the analogy of peanut butter and jam.

  • Swapnil Bhartiya: Right. I just want to go a bit, I want to zoom out a bit and talk about the larger picture. If you look at customers, we are trying to solve a certain problem for them, which is all about securing your reliability or whatever the term you want to use it. Cloud Native already is very, very complicated. There are so many Open Source projects. CNC of landscape is busy. There are vendors which is good. Ecosystem is booming, but which can be daunting for customers. As you rightly mentioned, there are different products coming from different vendors. If I just look at the whole evolution of you can call it observability, you can call it monitoring, you can call it availability. The basic idea is, you have written an app, you have deployed it. Now you want to make sure that it keeps running. It could be security, whatever it is, the basic goal is that it keeps running there.

So from that perspective, Ben, you and I have talked about the whole Open Telemetry earlier, a lot of consolidation and merger happened there. And I would love to have your opinion on this, RJ as well. What do you see? Where are we heading with this?

RJ Jainendra: Yeah. So, and that’s a great point because if you think about the work that needs to happen, when there is an incident, where seconds matter, minutes matter, I think it’s not helpful for developers in SREs that they have to context switch across a number of different tools to be able to figure out what’s going on and then how to involve the right people on the solutioning side. Right?

As we talk to customers, what they are learning is that invariably there may be four or five different technologies being used on the monitoring side and learning side. And then you have a separate solution for the on call and collaboration piece. And then yet again, something else for remediation.

So, what we see and believe is that if we’re able to bring these capabilities together, that a developer SRE needs to do their job right from that observability all the way the through Incident Response, collaboration and even getting to remediation and the postmortem, the learnings that you can then apply back to make a system more resilient. We believe that a platform like that would really help customers achieve that goal that Ben was talking about, which is reduced NTTR, and deliver these highly reliable and resilient services.

  • Swapnil Bhartiya: Ben do you want add anything to that?

Ben Sigelman: No, I think RJ said it very well, actually. The only thing I might add just on the margins is that, what we see with a lot of the customers we talk with, especially within really large enterprise environments where you have Cloud Native applications, which is the area where Lightstep specializes, they’re running on top of… like you’ll have some Cloud Native Kubernetes application, customer facing technology, that sort of thing. But then ultimately it’s running on top of a lot of other technology that’s probably been around for a long time and isn’t going anywhere, that’s managed by an IT operations team.

And another interesting thing about this, I mean Lightstep, as you know was acquired into ServiceNow last year. Our product roadmap has just been accelerated, but part of the beauty of this product is that it’s something that works well in the Cloud Native environment, but also it bridges totally natively into the IT operation’s brain stem, which is really powerful.

I think we hear a lot of organizations struggling to unify a modern approach to SRE that genuinely works in a Cloud Native environment with tooling that is designed for SREs who want to use Cloud Native tooling, and then also kind of phones home to the central nervous system of a business.

And that’s a really important aspect of Incident Response as well, because there are a lot of business processes that depend on… Everything related to the risks associated with Incident Response needs to plug into the business in general. And I think what we’ve done here is pretty unique in that regard as well.

  • Swapnil Bhartiya: Excellent. Can you now talk about what it looks like? Is it a service, is it a platform, is it a product? What is this offering called? And how does it work?

RJ Jainendra: Yeah, so Lightstep Incident Response is the product and it’s available on lightstep.com. And so from that perspective, what we you’ve done with Lightstep is we’ve elevated the brand where previously it was focused on observability. Now we have two product lines. One is around observability. The second is on Incident Response.

And on the Incident Response side, the new thing that we’re doing at ServiceNow is really making the solution something that’s self-serviced. So if you are an SRE, or a developer and you are struggling with some incident management, or if you’re a team lead, you can pretty much come to the website, sign up right away and get access to the product.

And the functionality is really being able to very quickly and seamless, set it up where you can start ingesting alerts from observability products. So not only Lightstep observability but if using other tools like Datadog, et cetera, we can bring all those alerts in, set up your teams, set up your on call team, set up response rules on how to process these alerts as they flow in and really provide sort of that end to end experience.

So it’s all in that Lightstep Incident Response product that’s available. The unique thing that we’re doing, we spent a lot of time and effort really understanding the workflows for SREs and developers. So the user experience has gotten a lot of attention in terms of making it very seamless for them to come in and do what they need to do.

We integrate with Slack and Zoom, some of the collaboration tools that invariably come into play in managing these kind of incidents. And the other thing that we’ve done is, as we talk to customers, we see them struggling with this idea of how to take concepts of service ownership, which previously may have been those Ops guys out there who are responsible for uptime.

It really changed the culture of the team, where every developer is involved in service ownership. And so we see customers really struggling because some of the existing products in the market, they are priced by seats. So then you’re starting to have them pick and choose which developers should be a part of that incident management process versus should not.

And so we’ve taken a really different approach to say, “Hey, we’re going to price by the active services, the value that customers are trying to deliver. And they can bring the entire team into the product and really involve all the developers in this idea of service ownership to deliver that uptime that they’re looking for.”

  • Swapnil Bhartiya: I mean you covered these two points like cost and of course, ease of use making it easier for them. Also, I want to talk a bit about cultural aspect here also because incident, you know when something happened, that’s when in the company everybody comes together, you have not even seen those faces earlier. So how much do you think is culture also involved and how does this… You know, sometimes you have to overcome the cultural barriers. Sometimes you have to become a catalyst to bring in the cultural change within the companies. So can you talk about that also?

RJ Jainendra: Yep. So certainly we see this… in smaller companies, typically, it’s all hands on deck. When something happens, pretty much the entire team swarms the problem to figure out how to get back up. Though as we look at larger organizations, we are starting to see larger customers as they adopt Agile and DevOps practices. Kind of creating these two pizza box size teams that have the “you build it, you run it” mentality and they do sit on top of a platform, as Ben alluded to earlier.

And so we do see this adoption of change in the industry coming to happen, where you have these product teams that are responsible for the entire life cycle. Amazon has really made this popular, but now there are many more companies moving into that model. So from our perspective, it’s really how do we… from a product and technology perspective, foster that type of collaboration where you can easily bring in the entire team into the process?

Ben Sigelman: Yeah. I totally agree with all that. What we found with Lightstep’s observability product over the years is that the whole point of moving to Cloud Native and to moving to microservices was to accelerate release velocity. So every small team can be making many changes concurrently, and it’s been successful, but the side effect of all those changes is that there are a lot of side effects.

I think if the side effect is in the service being deployed, hopefully if you’re doing CICD you catch onto that and you roll back the release, and that’s costly and annoying, but hopefully not an outage. A lot of outages, if you look at most postmortems for major outages that will be posted to the internet, it’s almost always some intentional change. Like someone makes a change somewhere. It has an unintended side effect and it affects another team.

Mapping from team to team is a very difficult problem in observability, and something that Lightstep has really specialized since day one, with our focus on really moving tracing forward in the industry in general. That’s the point, it is to help diagnose issues that cross service boundaries and cross team boundaries.

And that really gets back to what RJ was saying about how aggressive a seat model actually is for Incident Response. The incident management side of observability really needs to be an all hands on deck thing in terms of whose in the tooling and who can help. Because oftentimes a developer will push a change into the code base, that maybe an SRE that’s running parallel to them will push out into production and that will break something somewhere else in the stack.

All those people should be involved in the incident process. And that’s what I like about the approach that we’ve taken. And I think we’re thinking about the system as a whole, both the software and the human and operational side of this with the release that we’re announcing here and with the direction that we’re taking.

  • Swapnil Bhartiya: Right. What I hear is also there are so many benefits of bringing observability and Incident Response together. You folks also touched upon some, but if I want you to kind of summarize, because I always care about, how does this change the life of a developer at Ops? Depending on who the target is. So if you can say, “Hey, this is the last thing they have to do, last thing they have to worry about, and they get everything in one place.” What would that be?

RJ Jainendra: That’s exactly right. So what we do is eliminate that context switch because we’re bringing the information from observability, and has been alluded to the information about what are the different services that might be impacted by a single code change. So all of that information and context becomes available in one place for the developer to then quickly pinpoint the issue and then resolve it.

  • Swapnil Bhartiya: Excellent. Now, Ben… One second. It could be either RJ or Ben, or I would love to have insights from both of you is that, I mean, we have been covering Lightstep from very early time. But if I look at the company itself, especially with this announcement, what is the long term game, goal or vision for Lightstep? Do you see yourself… I mean, of course you cannot call yourself pure play observability platform anymore. Are you moving to SRE field? Of course, ServiceNow is also there. Which is actually a good thing, as you said, investment is getting into there and the product flow is growing, but when we look at Lightstep, what kind of company are we looking at today?

Ben Sigelman: Yeah, it’s a great question. I mean, I probably won’t get into talking about specific product announcements and things like that, but directionally, I think we see…I mentioned this actually in the beginning of this conversation, that when we talk about the workflow level of what our end users are struggling with right now, there’s a ton of fragmentation.

The Incident Response piece is probably the most glaring and obvious piece of that, but there are many other aspects where workflows that go in and out of the observability use cases that we started with are quite fragmented. And I think our goal is to try and… obviously talking with customers, to take on the workflows we feel we can do something. It’s not just unifying a bill, right? It’s actually unifying workflows and reducing context switching. That’s what we’re really trying to accomplish.

And since you mentioned the ServiceNow piece, I will also say that, there is adjacent messaging with ServiceNow, but it’s a totally non-competitive thing with ServiceNow, which is frankly, if it wasn’t the case, I wouldn’t have done the acquisition. And there’s a lot of multiplicative benefits with ServiceNow being so, so strong, across many different employee facing workflows.

What we can do with Lightstep is to see inside these customer facing applications and pull out insights about everything from customer behavior, to cost, to security, to basic IT operations, all that stuff really enhances the ServiceNow offering as well. So it’s not always that there’s a new skew or a new product. A lot of it is significant differentiated enhancements in these other products that are actually affecting very different employees, well outside of the development sphere.

So that’s another area where I think there’s a lot of opportunity to deliver something that’s pretty unified and valuable. And the whole point in some ways is that we wouldn’t be necessarily releasing a new product. I think it’s more that we’re enhancing and delivering value that couldn’t be done otherwise through just a plain partnership.

  • Swapnil Bhartiya: RJ, I would ask you that, how does Lightstep fit into your broader vision? How does that come in, further compliments whatever you have to offer? Because as we were talking about earlier, in the end it all matters how we are serving our customers and users.

RJ Jainendra: And that’s exactly right. From that perspective a lot of what Ben said, just to add to that. So we do see the benefit of extending observability, you know, the enterprise, into not just the IT operations, but the other use cases that are out there. And fundamentally ServiceNow is now… its core competency is workflows.

So from that perspective, we can envision a number of different workflow based solutions that can harness the insights that observability brings, and then serve our customers. I’m very interested in the SRE domain and how we can continue to make the life of the SRE better. These are folks who work under a lot of pressure, a lot of stress. There’s a lot of chaos and fatigue. And so anything we can do to service and improve their state is a win.

  • Swapnil Bhartiya: No, you are right, because a lot of mass resignations was happening. People are burning out, because also as the cultural change are happening, there are so many things that are stopping in my back. You know, as a developer, you have to not only write that ticket, deploy it and secure it, and then security is a cat and mouse game, you never get any sleep at night. So yeah, people are getting… so the more we make their life easier, the better it is. I think we have everything about this announcement and we actually talked more on that. Is there anything else that you folks want to discuss or do you think that we have had a good discussion and we can wrap this up?

RJ Jainendra: That was a great discussion. We can wrap it up. Our products are available on lightstep.com. Go check it out.

  • Swapnil Bhartiya: Ben, RJ, thank you so much for taking time out today. And of course, talk about this announcement, but we went deeper and try to understand how observability, Incident Response, how we are trying to help developers and make their life easier. So thanks for those insights as well. And I would love to have you both on the show again. Thank you.

Ben Sigelman: Thank you so much.

RJ Jainendra: Thanks Swapnil. It was really nice to meet you.