Why Is Large-scale Kubernetes Monitoring So Hard?

As your Kubernetes environments grow and encompass more services, how do you keep them running smoothly?

Change is the only constant, right? We are living in a world where technology has improved dramatically from every perspective, at breakneck speed. Thanks to disruptive innovations like Kubernetes and elastic cloud environments, applications behave much better from a performance perspective. We no longer get outright outages as frequently as we did ten years ago. But as innovation expands, so does complexity, which makes it an ongoing game of leapfrog to keep on top of the application reliability challenges.

Suppose you’re running your business-critical services and applications in Kubernetes today. In that case, you’re probably managing the stability of your environment using a diverse mixture of APM tools, developer tools, Elastic, AWS CloudWatch and Kubectl commands, and more. But this eclectic approach presents challenges. You’re getting data on logs, metrics, and events from a variety of sources. These streams are not correlated, not delivered within the context of related services and dependencies, and perhaps hardest of all… not mapped across the axis of time.

Even before the pervasiveness of containers, as infrastructure evolved from physical, to virtual, to software-defined, it’s been a consistent challenge to surmount monitoring hurdles. In all these environments, the landscape would change frequently and was challenging to track over time.

Yet despite the ever-growing complexities, service level commitments haven’t changed. You still need to keep your environment humming along smoothly 99.99% of the time. When something breaks, how do you bring it all together to find the cause and which team do you call to fix it? How do you aggregate all data streams into usable information that helps you respond quickly to outages?

Many companies call panicked war room meetings to gather all teams, bringing data from their respective silos. At best, a painful, inconvenient gathering, war rooms eat up valuable time until a clear cause has been found and the right team is on the hook to fix it.

Some SREs bite the bullet and dig through metrics and log files across different systems. Then they call who they think is responsible… though that team might “pass the buck” and say it’s not their issue. More time lost.

Others try to aggregate everything into Splunk or Elastic. Some feel that adding red tape is necessary to ensure all builds work homogeneously in several iterations of sub-production environments. Still others use the dreaded 4 letter term, ITIL, and change approval boards and rigorous approval processes to help. And yet others have decided to invest in hugely expensive AIOps projects, which often yield nothing more than black boxes spewing out incidents like Tic Tacs.

Are these approaches working now? Will they scale?

The need for a more comprehensive and correlated view of data is clear, but what might that look like? The first challenge is to get your environment represented in a visual topology.

You need to show the relationships between namespaces, nodes, pods, etc… and how a change in a container may impact the availability of a sister Service. Basically, you need a map of everything – not just logs but topology, telemetry, and events – all combined into one view that highlights dependencies. Did something break? What other related components might have caused the problem or be affected by it?

This need for visual topology is well recognized. There are tons of topology and APM tools that can provide you with a real-time view of your applications. However, these tools often prove to be expensive. Their tedious roll-out strategies can lead to the IT equivalent of setting up flood defenses in the middle of a hurricane. Perhaps worst of all, you can never monitor everything you need to.

A lot of teams follow the guidance of AIOps tools to find the source of a failure. These tools often require meticulous algorithms to effectively run in the short-term to get value from the deployment. But from a monitoring perspective, the result is a black box that doesn’t provide all the information you need. It’s kind of like a student handing in a math test showing just the answer but not the solving process. You get a projected overview of what may have happened in your environment based on a generic algorithm. Still, they don’t have a definitive way to show what did happen. You still need to validate the problem: triage the alerts going off, see what changes occurred, and look through logs to ensure the projected cause is indeed correct.

The future of cloud-native containerized monitoring requires the correlation of Telemetry + Topology + Time

Let’s consider the concept of a correlated view of Telemetry + Topology + Time. To get full visibility into your stack and zoom in on the cause of performance issues, you need to know what your whole environment looked like over time. Not just a log of events that have happened over time, but an ability to drill down into any moment in time and visualize the specific topology, events impacting the topology and the associated Golden Signals at that exact point.

What containers were running, using what resources, and how much Saturation was there at that point in time? Having this data plotted into a dedicated system will help us do anything from identifying the cause of an issue to figuring out why the AWS bill was so high last month.

Kubernetes environments change too frequently to use a traditional CMDB to understand the gospel of the state of the stack. To effectively manage highly complex and innovative environments, we need fundamental technology that can keep track of data across multiple sources and visually show the combined topology of what’s going on – in real-time and over time.

Monitoring tools will need to invest in their capabilities to correlate all data at all points in time – topology + telemetry + traces + events + logs, etc. Tomorrow’s Kubernetes tool needs to take regular, auditable snapshots of your environment and provide a time-series view of topology that gives context to all time-series data collected across all sources. Without this level of depth and correlation at scale, we will be forever challenged to maintain reliability in environments characterized by constant change.

Author: Anthony Evans Solutions Engineer, StackState

Bio: Anthony Evans is lifelong technologist who has spent the last 15 years helping companies advance their cloud capabilities while maintaining reliability. Currently a Solutions Engineer with StackState, he has worked extensively in the SaaS, AI and Service Management landscapes, supporting customers on their cloud-native containerized application journeys. Anthony’s background at companies such as ServiceNow, IPSoft, Espressive, and AISERA allows him to understand first-hand the need for innovation, while at the same time ensuring customers operate as smoothly as possible.

To hear more about cloud native topics, join the Cloud Native Computing Foundation and cloud native community at KubeCon+CloudNativeCon North America 2021 – October 11-15, 2021

Why Is Large-scale Kubernetes Monitoring So Hard?

Cloud-native WebAssembly in Service Mesh

Taos Launches Cloud Advisory Services Suite And DevSecOps Security Subscription Service

Cloud-native WebAssembly in Service Mesh

Taos Launches Cloud Advisory Services Suite And DevSecOps Security Subscription Service

You may also like

Why Team Silos Break High Availability in Complex Environments | Matthew Pollard, SIOS Technology | TFiR

One Control Plane for All Data Services Across Kubernetes and Cloud | Julian Fischer, anynines | TFiR

The CFO’s Guide to Java Runtime Efficiency | Peter Maloney, Azul | TFiR

The Hidden Risks of Untested HA Environments | Cassius Rhue, SIOS Technology | TFiR

The RBAC Reality Check for AI in Platform Engineering | Corey McGalliard, Akamai Cloud | TFiR

Why AI Compounds Cloud Cost Problems and How Java Runtime Tuning Fixes It | Peter Maloney, Azul | TFiR