Observability Means Being Able To Get Insights Into A Black Box System: Andreas Grabner, Dynatrace

Guest: Andreas Grabner (LinkedIn, Twitter)
Company: Dynatrace (Twitter) | Project: Keptn (Twitter, LinkedIn)
Show: Let’s Talk

Dynatrace is all about helping customers build and operate software that runs perfectly for their customers to support business needs. The company predates many of the cloud-native technologies that we rely on today. But as the IT landscape changed, so did Dynatrace. It continues to help customers in their journey by providing software intelligence to simplify cloud complexity and accelerate digital transformation.

We have been covering Dynatrace on TFiR for a while now and I finally got the opportunity to host Andreas Grabner, DevOps Activist at Dynatrace and Developer Advocate for CNCF Keptn.

“We are in the observability space,” says Grabner, “That means Dynatrace itself is a software intelligence platform that is pulling in data through different ways of observability, and then really giving this data to the different stakeholders whether it’s developers early on to make sure that they understand the impact of their code changes, or also the business side to show them how is the software actually used.”

Grabner has been with Dynatrace for more than 14 years and has seen the evolution of the IT stack and cloud infrastructure. He says that meeting the modern needs “means additional challenges for observability vendors on how we can actually get the data.” Grabner continues, “A lot of things have changed. But the ultimate premise is still the same: How can we help the software engineers, the performance engineers, the site reliability engineers, the DevOps engineers, and the business with better data so that they can make better decisions on what the next step should be with the software services.”

Grabner also comments on how observability has changed over the years when he says, “If I look back 14 years ago, it was easier for me to analyze a monolithic Java application because there was a very small scope of data that I had available.” As to observability in the current landscape, Grabner points out that companies are deploying to different clouds, running in the JVM or serverless frameworks, or even in systems within clouds they don’t have access to. In the end, Grabner says, “We need to make sense of the data and give not just more data, but actually actionable answers to people.”

Perform 2022 is on the horizon and of this event, Grabner says, “It is the event where Dynatrace and the larger observability community come together.” The theme for this year’s event is “Game Changers” and Grabner finishes up by saying, “We bring people on stage, whether in the keynotes or in the breakdowns, that talk about how they have changed their lives over the last couple of months and years to become a game changer in their organization so that they can deliver better digital services to their end users.” The format of Perform 2022 will be fully virtual and attendees can sign up on the Dynatrace site.

There was so much to learn from him in this discussion. Here are some of the topics we covered in this show:

What does Dynatrace do?
How has Dynatrace evolved over time? How has observability itself evolved? How has software intelligence evolved as we are moving from the old stack to a cloud-centric world?
I was also curious to know how Grabner defines observability.
We then talked about the significance of observability for businesses and why they should care about it and why it should be a core part of their strategy.
We talked a lot about the Shift Left and DevOps movement. But how much is happening in reality?
I then asked him if he could share a playbook for DevOps to get started with when it comes to observability.
As expected we also talked about Keptn, a project that was created by Dynatrace and donated to CNCF.
Last but not least, we talked about the upcoming Perform 2022 event that will be hosted virtually.

It was an incredible experience talking to Grabner. I hope you will enjoy the discussion as well.

The summary of the show is written by Jack Wallen

[expander_maker]

Swapnil Bhartiya: Hi. This is your Swapnil Bhartiya and welcome to another episode of Let’s Talk. And today we have with us, Andreas Grabner, DevOps activist at Dynatrace. Andi, it’s great to have you on the show.

Andi Grabner: Yeah, thank you for inviting me to this show. It’s an honor to be here. I saw some of your previous episodes. So it seems… Well, I’m lucky that you considered me as a guest. Thank you.

Swapnil Bhartiya: No, we are honored to have you on the show. And since this is the first time we’re talking to you and to Dynatrace, so I would love to know… I mean, of course, we do know what you folks do. But just for our audience, tell us about the company. What do you folks do?

Andi Grabner: Yeah. So, our mission is to help our customers to build and operate software that runs perfectly for their customers to support their business needs. I personally have been with the company for almost 14 years. So from day one, I was really excited about the mission statement of Dynatrace and I’m still here after 14 years and hopefully many more years to come. In my role, I help our customers, also the wider IT community, with building, testing, operating software that supports the business as I said. And we are in the observability space. That means Dynatrace itself is a software intelligence platform that is pulling in data through different ways of observability, and then really giving this data to the different stakeholders whether it’s developers early on to make sure that they understand the impact of their code changes, or also the business side to show them how is the software actually used, where do things maybe need some tweaking to make sure that their end users have a better experience, yeah.

Swapnil Bhartiya: I mean, of course as you said, you have been with the company for almost 14 or so years. Company has been around for a long time which also means that you folks have seen the evolution of what we call about the whole observability is changing. So, can you also talk about how Dynatrace has evolved? How has observability itself evolved? How has software intelligence evolved too as we are moving from the old stack to cloud centric world?

Andi Grabner: Yeah. So remember, when I joined back in 2008, right? The technology we had was already centered around distributed tracing, the hype these days. But we’ve been doing this when I joined 2008. We were mainly doing this on the classical, let’s say, three tier applications that we had back then. A monolithic app with a database backend and some front end. And we were doing distributed tracing and gathering additional information that helped especially performance architects and the developers to make sure that their software runs smoothly and in case there’s any problem, we can point them out. But as you said, this has changed over the years. Then we saw obviously the rise of the microservices. We saw the rise of the container architectures. We see people moving their applications from on premise to the cloud, or also back again, right? As a lot of different hybrid models, a lot of more moving pieces.

Also, back then, 14 years ago when I started, there were a handful of technologies we supported. Java.net, some web servers and application servers. Nowadays, well, you have a broad range of technologies that you control or don’t have control over because it depends on where it runs. And this also means additional challenges for observability vendors on how we can actually get the data. And so, a lot of things have changed. But the ultimate premise is still the same, right? It is how can we help the software engineers, the performance engineers, the site reliability engineers, the DevOps engineers, and the business with better data so that they can make better decisions on what the next step should be with the software services. How they innovate, right?

Swapnil Bhartiya: I also quickly want to talk about observability itself because of course once again, it’s evolving. We do talk a lot about it. But knowing something is going on in your system is as important as it is to do something about it. Actionability also. So, how would you define observability? Where does the role of observing the system if it stops and the role of doing something about it, it starts? Or it’s just a tag or label that we use to look at it holistically?

Andi Grabner: Yeah. I think if you ask me, you may get a different answer than you asked five other so-called experts. Well, I don’t call myself an expert even though I’ve been in this space for a while. I think for me, what observability means is being able to get insights into a black box system whether this is in the forms of the key pillars of observability, the logs, the traces, and the metrics or events. But in the end, observability is how can I get health signals out of an otherwise kind of black box system? And there’s different ways of doing it. The run times when our apps run, whether it’s a Java runtime, an old runtime, a go runtime, whatever it is, or a serverless framework, these run times provide already some out of the box observability fortunately because these run times have also evolved.

Then there are techniques that we have kind of developed over the years to automatically inject more instrumentation points to get more data out of it. We have seen fortunately the rise of frameworks like [permit fields 00:05:23] which became definitely one of the standards in order to get your metrics out. And we have OpenTelemetry which is a great open source project that encourages developers, but also especially, framework providers or software vendors to pre-instrument their code because they should know their code best and what’s important. So, observability is all about how can we collect the data. But then, the question is what do you do with this data? Because again, if I look back 14 years ago, it was easier for me to analyze a monolithic Java application because there was a very small scope of data that I had available.

Now, I have everything from stuff that runs in different clouds, running still in the JVM or in the serverless framework, some running in systems that I don’t have access at all because it’s a cloud service. I have so many more moving pieces. So the thing though is I don’t have more time whether I’m a developer, whether I’m a DevOps engineer in SRE, or business. And this is where I think the role of us like Dynatrace comes in. We need to make sense of the data and give not just more data, but actually actionable answers to people in saying, “Hey, this particular code change you’ve just included a new security vulnerability,” or, “This is going to cause you a performance regression,” or, “This particular change in your user interface will not make the users happy because they’re now no longer interacting like they used to.” So I think the magic is not only collecting the data, but the thing we need to do as an industry is make sense of the data and give individual stakeholders the data they need to make a better next call on what they do next.

Swapnil Bhartiya: Excellent. Now, I am also curious that all the work that you folks are doing, you talked about security and you also talked about, of course, performance. And I think today businesses, these are two things we have started talking a lot about. Security is becoming prime after looking at all the recent vulnerabilities and attacks. And second is performance or reliability of the site. When we look at the whole cloud or modern stack, it is already very complicated. If you look at Kubernetes, things get complicated even more quickly. So now, we are talking one more thing in the whole stack. So, can you talk about the importance of having an observability strategy at company? How kind of it affects their bottom line? Because security is important, reliability is important. So talk about why businesses should care about reliability or I mean, in a way, observability? How much they’re only doing, or you feel that, “Hey, you know what? We still need a lot of awareness so that companies should have observability as core part of their strategy.”

Andi Grabner: Yeah. I mean, let’s take the recent example, right? Log4Shell is still on our ninth because it ruined a lot of Christmases for many engineers out there unfortunately. But thanks for everybody that was really working hard. But I think it showed a lot of organizations that they don’t even know whether A, they run this particular vulnerability library, what impact it has if it runs on systems that are actually exposed to the internet or not. Because just knowing that something is potentially loaded maybe overflows you with a lot of data. You need to be focused on what is actually really important. And in the case of Log4Shell, it is those systems that are exposed to the public internet. And I think this was a realization for many companies that they were missing this type of aspect of observability.

For me and for us at Dynatrace, this is part of our platform. We do not only do classical metrics, logs and traces. We also automatically scan for vulnerabilities, but not only while the code gets loaded, but also as new vulnerabilities are detected and reported. And so, I think unfortunately, sometimes it takes incidents like Log4Shell. It takes an event where a whole data center, a whole region of a cloud vendor goes down and people realize, “We didn’t even know about it until people called us up. Why didn’t we know about this? Why don’t we have observability? But more importantly, why don’t we have systems that proactively alert us about something bad that is happening or is about to happen.” I think these events as unfortunate as they are, they make sure that the rest of the IT industry that has not yet invested into smart observability is now waking up and is definitely investing in it.

Swapnil Bhartiya: Right. And once again, I want to ask the question. I just want to go a bit deeper. We talk about shift-left a lot. A lot of things are moving a developer’s pipeline. So how much are you seeing this already, the practice versus we preach a lot about it but nobody’s doing it?

Andi Grabner: No. In my role, I’ve been preaching shift-left, so basically using observability, as early as possible for the last, I think, eight, nine years since I’m in my current role in Dynatrace. I know that I’m not the only one out there that is preaching it, but I think we still have obviously stuff to do because not the whole world is doing it. And we see this not only internally at our company, because we obviously are a software company like all the others that we support, but I see it with many of our accounts that they’re shifting-left, right?

We have people at our upcoming conference at Perform 2022 that go on stage and really say how they have shifted-left with amateurs, how they integrate observability, how they integrate SLO, service level objectives, that are ranging from performance metrics, reliability metrics, but also secured business metrics and injected into their Jenkins pipelines, into the Gitlab pipelines to make sure that they’re not allowed to push any bad code changes that either impact performance, reliability, or security into production. So I think we still have a long way to go until everybody’s doing it, but we have some great examples at our upcoming Perform conference that are going to be on stage. Some actually also talked with me in the breakouts and they talk about their best practices and how to shift-lift and how to detect things earlier.

Swapnil Bhartiya: Yeah, I’ll talk about the Perform 2022 in a bit. I just quickly also want to talk as these engineers, the DevOp, they do embraces. Do you have any playbook, any set of rules, or best practices that they can follow to at least get it started?

Andi Grabner: Yeah. So what I’ve been kind of preaching over the last year and a half is establishing SLOs, service level objectives, right? It comes out of the site reliability engineering movement that Google, I think, has done a great job in explaining the world who SRE is at Google. SLO, service level objectives, is a key concept. And so, one of the best practices we are preaching is define your set of SLOs that are important to the business. Availability, performance user experience. Define it, monitor and alert on it in production. But then, take those SLOs and break it down and shifting them left into your testing environments, so that every test that is executed will make sure that the software once released in production has not violated the critical SLOs.

But then, shifting it even further left into development which means, right? If you have, let’s say, an availability SLO in production, that means for developer, what does it mean? Well, he or she can make sure that their software is not through a code change increasing memory usage or latency because this will later on impact performance and availability. So what we are teaching and there’re different blogs, tutorials out there that we preach. We also have services that we are doing with our customers. We are taking the SLO on a journey from business and then shifting it left all the way into development, making sure in the end, everybody’s connected with the common business goals, but everybody’s contributing in a different way. A developer is contributing in a different way than an SRE, than a performance engineer, and then business.

Swapnil Bhartiya: Excellent, thanks for explaining that. Now, I quickly want to touch upon Keptn as well. What’s going on with Keptn? Give us an update on it.

Andi Grabner: So yeah, thanks for bringing this up. So Keptn, for those people that don’t know it, it’s spelled K-E-P-T-N and it’s the German phonetic of a captain of a ship. Because people sometimes wonder, “What is this all about?” So what we have realized at Dynatrace is the concept that I just mentioned earlier, SLOs are very important. As humans, we make decisions based on data. And in our world right now, a lot of things are centered around SLOs. Therefore, what we have done with Keptn, we have come up with a new project that is taking SLOs at the center and it’s then orchestrating your DevOps tool chain to push artifacts through your pipeline all the way to production, but also orchestrate all the remediation in case things go wrong in production. And then, orchestrate the right tools to bring a system back to its desired state, but always looking at your SLOs, your service level objectives, to then make decision in which direction the orchestration should go.

A year and a half ago, Dynatrace donated the Keptn project to the CNCF, the Cloud Native Computing Foundation. We are a sandbox project, but just got our sponsor at CubeCon to become an incubator. So, that’s the next goal for next year. I think it’s exciting times because one of the other things Keptn does pretty well is it solves a massive problem that we see in many organizations. Everybody knows we need to do more automation. But actually, it’s not automation because individual tools do a pretty good job in automating certain tests. The challenge becomes orchestrating these tools to automate your different sequences, right? Delivery sequence or the remediation sequences.

And what we have seen internally and with our users, a lot of custom coding goes into connecting all these tools because every tool has a different API and so on. With Keptn, we have an open source project where we are currently standardizing the way all the different DevOps tool talk with each other. And Keptn provides the reference implementation for this orchestration on these new events and as a center SLOs. So, orchestration is always moving forward depending on how your SLOs are doing and the SLOs come from the observability platform.

Swapnil Bhartiya: Now let’s talk about this event, Perform 2022. Please tell us what is going to be the focus? What is going to be the format? Is it virtual? Is it in person? Talk about it.

Andi Grabner: So, the conference is Perform. I think it’s now the 10th or 11th time, at least as far as I remember back, we’ve been doing this pretty consistently. It is the event where the Dynatrace, but really the larger observability community, comes together. This year, the theme is game changers. So we really bring people on stage, whether in the keynotes or in the breakdowns, that talk about how they have changed their lives over the last couple of months and years to become game changer in their organization, to become game changer so that they can deliver better digital services to their end users. So the format was just announced that we are, unfortunately because of the current situation that is going on in the world, moving to a fully virtual event.

Nevertheless, still the same great speakers, the same great atmosphere. It’s about users, practitioners talking about what they do in a day to day life with observability, with the Dynatrace platform, but also with our open source initiatives that they are interacting with like Keptn and others. I’m personally very looking forward to one particular speaker that I have to happen interviewing on stage, Kelsey Hightower. So Kelsey Hightower, and lot of people refer to him as Mr. Kubernetes, he’s also going to talk about the evolvement of Kubernetes, the chances but also the complexity and how we have to tame the complexity and how Dynatrace helps you as well.

Swapnil Bhartiya: Excellent. And since it’s going to be virtual, what is the best way for folks to better consume Perform 2022? Do you have any tips for that as well?

Andi Grabner: Yeah, the easiest is to go to the Dynatrace website. Either go dynatrace.com, I’m sure you’ll find the link there to Perform, or I think it is perform.dynatrace.com. Register, get your seat. It’s all streamed live and in the different time zones. So, it’s a truly global event. We have different time zones that we cover. There will also be an option to then consume that content whenever obviously it makes sense for you. So, that means we adapt to your needs. But certain things obviously are always great to be consumed live like the keynotes, the great announcements that we’re going to make.

Swapnil Bhartiya: Andi, thank you so much for taking time out today and of course talk about Dynatrace observability and more importantly, the Perform event that’s coming up. And we’d love to of course have you back on the show again. So, thank you for your time today.

Andi Grabner: Yeah, thanks for having me. And as you said, hopefully, this was just the first time and not the last time. See you soon.

[/expander_maker]

You may also like

Acorn Labs’ GPTScript aims to redefine coding for AI applications

Kubernetes is not for the weakhearted, so we are trying to simplify it | Sudeep Goswami

Salt Security sheds light on security risks of LLMs

You need a robust cyber materiality program | CISO Insights E2

Akamai offers NVIDIA RTX 4000 Ada GPUs for gaming and media

Are VMware customers looking at alternatives post-Broadcom acquisition?