OpenTelemetry Helps Navigate Cloud-Native Complexities With Observability Standards

Handling telemetry data can be problematic, particularly when it is in different languages and from different sources. OpenTelemetry, an open source observability framework, aims to standardize how systems describe themselves, pulling together the three core signals (tracing, metrics, and logs) into a broad graph. This cuts down the time spent with trying to process searching and collecting the data, so developers can get to the source of problems quicker.

“Humans are not good at that kind of data processing. So, that is a very slow painful part of the process right now. OpenTelemetry provides all of that data connected together into a single correlated graph, which means it’s possible for people to now write analysis tools that do a lot of that data gathering and correlation analysis for you,” says Ted Young, Director of Developer Education at Lightstep, at this year’s KubeCon + CloudNativeCon Europe conference.

Key highlights of the video interview are:

Young discusses the basics of OpenTelemetry and how it helps with observability. He explains the challenges of handling telemetry data and some of the scenarios developers encounter. He discusses how OpenTelemetry helps to pull together the three core signals into a single data protocol.
The OpenTelemetry project aims to help ease some of the sticking points with Kubernetes and cloud native. Young details how OpenTelemetry works to solve some of the potential difficulties with Kubernetes and cloud native by standardizing how systems describe themselves.
Once all the data from the different sources has been connected into a broad graph, people can feed it into an analysis tool like LightStep. Young explains how this helps developers assess errors and find out what correlates with the errors so that they can find the source of the problem.
Young discusses the benefits of using OpenTelemetry and having the same standards and formats explaining itself as the applications.
Young describes the state of the OpenTelemetry project, what they are focusing on and the timeline for expanding the new features.
The CNCF landscape has a variety of observability tools, such as Prometheus or Jaeger. Young explains where OpenTelemetry fits in the context of these other projects and how they complement the other projects.

Connect with Ted Young (LinkedIn, Twitter)

The summary of the show is written by Emily Nicholls.

[expander_maker]

Here is the automated and unedited transcript of the recording. Please note that the transcript has not been edited or reviewed.

Swapnil Bhartiya: Hi, this is your host, Swapnil Bhartiya, and welcome to KubeCon and CloudNativeCon here in Valencia. And today we have with us, once again, Ted Young. You are director of developer education at LightStep. First of all, it’s great to see you in person after a long time.

Ted Young: Yeah, it’s awesome. I’m so happy that we’re doing these in-person events again.

Swapnil Bhartiya: Yeah. Events are opening up, but if I’m not wrong, this is the first KubeCon, of the year we can see, which is in-person more coming. You have been, I’m pretty sure, the keynote. And you have attended sessions. You have been to the booth section. What kind of energy you are feeling? What do you see?

Ted Young: It feels really good. It feels like people are surging back into action. We had an OpenTelemetry project meeting yesterday and it was a packed room, a lot of excitement. So, that’s been feeling really good.

Swapnil Bhartiya: Perfect. Let’s talk about OpenTelemetry because you brought in the topic. So, first of all, just explain it to the audience. We have covered it so many times, but it’s always good to refresh their memories, that what is it about and what role is it playing today? Because they’re also going to talk about some improvement you folks have made. So, let’s just start from the basics.

Ted Young: Sure, absolutely. So OpenTelemetry is a telemetry system. So if you think about observability, it’s really broken into two parts. There are systems describing what it is that they’re doing. And then there’s an analysis tool that’s looking at that data and trying to provide insights. So, OpenTelemetry is the first half. It’s the language systems use to describe what it is that they’re doing. It has three core signals. One is tracing. A second is metrics. And a third is logs. And those signals are braided together into a single data protocol called OTLP.

Swapnil Bhartiya: We talked about this earlier, I love to talk about it, because as you just say two things, one is that, what’s happening, what’s going on? And then analysis. There is one more step, which is actually to do something about what is happening.

Ted Young: Yes.

Swapnil Bhartiya: So, when you do look at the whole observability space where it is like outside the scope or those, since we love new labels so much, you keep creating them, logging, monitoring, and the observability and something next will be coming soon. So, because in the end, what we are trying to do is solve a problem.

Ted Young: Yes.

Swapnil Bhartiya: So talk about that as well.

Ted Young: Yeah. So I think you make a good point that people are not… You don’t have a metrics problem or a log problem or a tracing problem. Your system has some kind of systematic failure. Either there’s a literal bug in the code. The code is just doing something wrong. But, more often than not, assuming people are doing testing and catching most of those bugs, a lot of the problems in production come from the way all these different pieces of code and all these different services are interacting live.

So, you often end up with what’s called resource contention. So you have a bunch of independent requests that are all hitting the same services at the same time. They’re all trying to use the same resources. And they might interact with each other in a way that’s surprising and hard to discern when you’re just testing and development, at small scale.

That large scale interaction, that happen live in production, can be very difficult to track down. And in order to track those down, you end up needing to look at a number of different data sources and try to build up a mental model of what’s happening.

And if the data that you’re using is siloed, so your metrics are over here, completely unconnected from your logs, which are completely unconnected from your traces, as an operator, when you form a hypothesis and you want to validate your hypothesis about what’s wrong, you then have to do this data gathering across all these different signals.

And since they’re not connected together, you have to become the glue that’s connecting all of this data together. And you have to find all those correlations.

And humans are not good at that kind of data processing. So, that is a very slow painful part of the process right now.

OpenTelemetry provides all of that data connected together into a single correlated graph, which means it’s possible for people to now write analysis tools that do a lot of that data gathering and correlation analysis for you.

So, you can go quickly from having a hypothesis, and then validating the hypothesis, and automating the data searching and data collection, you would need to do in the middle.

Swapnil Bhartiya: Excellent. Thanks for explaining that. And also, you mentioned three signals, and new folks, that’s also we are going to talk about, but before we go there, I also understand that how LightStep, or if you look at OpenTelemetry, it comes to ease some of these pains because Kubernetes or Cloud Native itself is very complicated.

Ted Young: Yes.

Swapnil Bhartiya: It could be painful. So talk about that. So that, once again, back to the point that you’re also making was in the end, it is to solve a problem. All these things are secondary.

Ted Young: Yeah. So, what OpenTelemetry adds is, because it’s open source, and it standardizes how systems describe themselves, so an HTTP request always looks like an HTTP request. A container always looks like a container. A Kubernetes pod always looks like a Kubernetes pod. It’s possible to add OpenTelemetry to all these different layers of this stack. So Kubernetes can now start emitting OpenTelemetry data. That’s something that’s in the works, getting OpenTelemetry integrated into Kubernetes.

Data services, like Kafka or MongoDB, or even hosted services like databases that cloud providers might give you, like AWS services or Google services, all of those services can now start emitting OpenTelemetry data, and that data can then be connected to the data that your applications are emitting.

So, you now have this broad graph coming in from all these different sources, but it’s all the same data. It’s all standardized.

So, you can now feed all of that into an analysis tool like LightStep, which is designed to look across all those different data sources and provide correlation.

So, for example, you might have an alert going off that a particular endpoint, let’s say it’s your checkout endpoint for your web store. Very critical if that stops working. Maybe you’re noticing there’s an error, spike in errors, or some problem with that checkout. You want to know what correlates with that errors. What’s the source of that problem?

One correlation might be a particular library, in a backend system, is having problems. It might also be that say a particular node, like let’s say a particular Kafka node or a particular server, it’s only the requests that are hitting that particular server that are having problems.

Or maybe it’s something a little more nuanced, like this backend service emits an error when the client that initiated the request is on a certain version of iOS.

Those kinds of correlations will tell you a lot about where to look for the source of the problem, but noticing that those things correlate when you’re talking about something like noticing that an error at this end point is associated with a particular virtual machine that a particular backend service is running on, or a particular version of a mobile client is the source of all the problems. That can really take a human a lot of time and a lot of digging around to notice, but tools like LightStep automate that correlation detection.

So, you go straight to seeing that there’s a correlation between your error and these other things. That won’t tell you what the cause is, but knowing that correlation exists will propel you to a hypothesis much faster than if you have to just search around a whole bunch to even notice that the correlation is there.

Swapnil Bhartiya: Excellent. Once again, thanks for explaining that in detail. Sometimes when you explain it, it looks like visually also, the way you explain things. I love it. One more thing is that point of making things easier or better, can you also talk about, because you touched upon that, as well as natively integrating with Kubernetes. First of all, what does that mean for users? One per second, what does it mean for Kubernetes and the maintainer community?

Ted Young: Yeah, it just means that if you’re running your applications on top of Kubernetes, normally you would just get observability data out of your application. And Kubernetes itself has had some amount of observability baked into it. There’s an event system. I believe it does produce some metrics. But it doesn’t have the kind of observability that you would get out of fully instrumenting, all of Kubernetes with OpenTelemetry. So, the advantage of integrating OpenTelemetry into Kubernetes is that’s the infrastructure your application is running on. And that infrastructure is going to provide new data sources, to find new correlations.

So, noticing, for example, that an application is restarting and being rescheduled a whole bunch by Kubernetes, and trying to understand why that’s happening. Like if an application’s having difficulty starting, you might be able to get some insights out of that. Noticing that the problem is occurring when several different applications are scheduled to run on the same VM. Just noticing that Kubernetes itself is under provisioned. You’re trying to run too many apps on too few machines.

That’s the kind of information you get out of your container platform. So having Kubernetes be able to emit that data using the same standards, the same formats, the same way of explaining itself as your applications, makes it a lot easier to then integrate that information later.

Swapnil Bhartiya: And, as earlier we were talking about, the changes that are going on with all these signals, metrics. So talk about now, once again, give us a state of the projects?

Ted Young: Sure. Yeah. So, in the current state of OpenTelemetry, tracing has been stable for quite some time, over a year. Metrics just went stable this week. So we have release candidates available for metrics, in a number of languages. And those will all be totally GA, probably in the next couple of weeks. And logging, the third major signal in OpenTelemetry, is mostly stable. Most of the logging pipeline is already built.

The last piece is just designing and implementing our own logging API. Currently today, if you want to use logging, you just need to use an existing logging API, but for some reasons we wanted to also write our own and that’ll be coming soon. So, probably by end of Q3, logging will be stable as well.

We’re also expanding from just backend services to also client telemetry. So, we’ve been working with RUM experts, real user monitoring experts, to specify how OpenTelemetry should describe desktop and mobile and browser clients. So, that’s also coming this summer. So, it’s a lot of good stuff.

Swapnil Bhartiya: Excellent. One more question, because OpenTelemetry itself is a merger of two different open source project and that was, once again, there was kind of overlap. Community was confused which project to go with, but then you look at the broader CNCF landscape, there are so many projects if you just go and look.

So, a lot of projects are doing some kind of overlap. So, it’s a very broad question, which is like also, we may not have the answer right now, but do you see in, especially in the observatory tracing telemetry space, that you do see, hey, these are the products. They kind of either complement, and then you do see that some cloud solution may happen, or you feel like these projects are trying to solve the same problem, and it’s better for users in the end?

Ted Young: Yeah. So I would say within the CNCF landscape, most of the other observability tools like Prometheus or Jaeger, the primary value that they give is they’re an analysis tool. So, they are a data storage tool for observability data and a UI and a set of analysis tools that people use to monitor their systems and investigate change and things like that.

OpenTelemetry is a data source for these tools. So, we worked extensively with the Prometheus design team when we designed the metrics pipeline and OpenTelemetry, to make sure that it would work well as a source of metrics data for Prometheus.

Likewise, we try to make sure that our tracing tools work well as a source of data for Jaeger. In fact, it’s like the primary source of data for Jaeger. And, long run, if these other tools choose to, Jaeger’s already doing this, a lot of vendors have already done this, but they can start retiring their own clients and instrumentation if they like, to just use OpenTelemetry as their primary ingestion pipeline, basically.

So, I don’t want to speak for other projects, but that’s the way I see them complementing each other. We have no interest in ever expanding OpenTelemetry to also become a data storage tool or an analysis tool, because we don’t think that’s something standardizable. That’s the place where it’s always green field. There’s always new kinds of analysis people do. There’s always a better way to do that.

But describing systems, having systems describe what they’re doing, that’s a place where we can all agree on some standards. So that’s, I think, why OpenTelemetry complements all the other observability products out there.

Swapnil Bhartiya: Excellent. And you have done it like very nicely, to maintain a balance also, but also share some insight that this is where things might be adding, but I also don’t want to talk on behalf of other projects also. So you play it safe, as well as you also express that what is the… In the end, as we were discussing, to help the customer. That’s what. These technologies are secondary.

So, once again, thank you so much for taking time out today. And it’s really beautiful to just sit in person and talk about this thing. The energy is different as well. So thanks for joining me today. And I hope that we will be doing more of these shows in-person, in future as well. And I hope you’ll enjoy the show as well.

Ted Young: Absolutely. Yeah. And hopefully I’ll see you at the next KubeCon.

Swapnil Bhartiya: Excellent. Thank you.

[/expander_maker]

You may also like

StarTree Cloud adds new observability and anomaly detection capabilities

SPDX 3.0 now supports SBOMs for AI applications

StarTree adds observability and anomaly detection capabilities to StarTree Cloud

Situational Awareness is key to implementing an effective cybersecurity strategy

Red Hat Lightspeed to expand GenAI capabilities across hybrid cloud portfolio

How Transposit helps companies collaborate across teams