The 5 Pillars of Distributed Tracing

Author: Ran Ribenzaft

No one likes problems with applications or services. Not the customer. Not the business. And certainly not the developer.

Microservices today are replacing cumbersome, fragile, and high-maintenance monolithic applications. In fact, a microservice is a small version of a monolithic app, designed for a specific task or business process.

Distributed microservice architectures have become business game-changers that can provide scalability, faster delivery, and efficient development for applications and services.

But…microservices have introduced new visibility challenges and newer and perhaps more intractable problems that can’t be fixed using traditional monitoring methods.

Microservices are not restricted to traditional HTTP communication methods. In fact, developers often implement asynchronous design patterns that remove the coupling between services and teams and improve the customer experience.

There are growing number of messages between services, so monitoring is more complex.

Monitoring metrics also are different — for example, CPU usage is not relevant when using managed services.

Finally, you might not have access to the host, so persistent monitoring agents cannot be used.

Microservice-based apps and services are simply different from monolithic apps and require a different kind of monitoring and troubleshooting.

Request Tracing and Microservices

When monolithic software breaks due to its fragility, there are a number of qualified methods for troubleshooting and debugging them.

One of the key methods for finding a problem is tracing. A trace follows the course of a request or system event from its source to its ultimate destination. The trace is stored in an application log, displayed in a console/terminal window, and then analyzed and inspected via development tools such as Microsoft’s Visual Studio.

When you have a problem with a microservice, you look for an internal defect or external factors, such as unpredictable scaling behavior. To discover the problem, you have to find where it runs and on which version of the service the problem occurred. Often, the environment is complex with multiple versions of a service running across different servers and locations.

As a result, existing tracing techniques are inadequate for debugging microservices and similar technologies. If you don’t know where your app runs, which services it’s using, and the path taken by a request, you simply can’t trace the event.

The Limits of Logs and Log Aggregation for Microservices

Method 1: Monitoring Individual Services and Applications

When encountering a runtime error with a microservice, developers will first consult the stack trace generated when a service or application is written in a compiled language, such as Java. For interpreted languages, such as Python, you’ll see these errors in the console. The stack trace indicates where and when the problem occurred. If there is some form of application log, developers will check this log to see three things: the requests the application received, when the requests occurred, and when the application responded to the requests.

System logs of the host device complete the view of an issue. These logs provide important background information but do not identify the issue. For example, using a higher than expected level of system resources may be a service with a memory leak, or a resource trying to use restricted filesystem locations may be a malware infection.

There are a number of advantages to relying on a service’s application and system log data.

Log data is widely available. Most software probably has some form of built-in logging.
Adding a logging mechanism is relatively simple if not built-in.
There are many tools available to read, process, and easily understand captured data.

The disadvantage here: we see only what is happening within a single service.

In contrast, a single request in a distributed system is handled by multiple services. By monitoring an individual service, we may have fixed the one problem, but we have ignored the impact of the problem on the distributed system and services.

Method 2: Log Aggregation

Log aggregation to locate and diagnose problems allows you to combine the logs off as many services as possible by collecting logging data from multiple sources and aggregating the results. Agents on the hosts collect the logs and then stream this data to a server for processing. A number of existing open-source and third-party tools and services are available, providing a powerful search to match patterns in the recorded data. Many of these tools are based on the Elasticsearch, Logstash, and Kibana (ELK) stack.

Unfortunately, log aggregation does have some drawbacks.

Log aggregation only captures data for individual services and lacks the relevant contextual data to demonstrate the wider impact of the problem.
Text files may not be preserved due to storage restrictions and thus, logging data to spot long-term trends is unavailable.
A cloud-based logging storage solution can be expensive over the long term.

Unfortunately, there is little difference in collecting data from a single service or multiple services. Why? Actually, we want to understand how the individual or collective problems of a service or services impact a request as it moves through a system.

Fixing the Problem With Distributed Tracing

With distributed tracing, you can monitor microservice-based apps/architectures, as well as identify and locate failures and improve performance. Distributed tracing follows the progress of a single request from its point of origin to its final destination. The requests can be synchronous (actual requests) or asynchronous (through message queues, streams or databases). As the request moves across multiple systems and domains, the request generates traces that record its interactions with the processes, APIs, and services.

Each trace is given its own unique ID and passes through a span (segment) that describes a given activity performed by a host system on the request. Every span indicates a single step within the request’s path and has a name, unique ID, and timestamp. A span can also carry additional metadata.

There are a number of ways to implement distributed tracing.

You can do it yourself.
Use distributed tracing engines: some are open-source and free. Popular tools include Jaegerand Zipkin. The engine collects the request, trace, and segment data and helps present, analyze, and visualize this data. But these tools have limited features and present the traces in a simple timeline with no payload visibility or alerts.
Use frameworks and libraries to build your own solutions or extend your existing monitoring and tracing tools. OpenTracingand OpenCensus are popular choices.
Use an automated, cloud-native solution.

Method 1: Do It Yourself (DIY)

The DIY tracing solution can track all payloads, uniquely identify messages and requests, and track requests over your system.

The DIY advantages are:

You can repurpose your existing tools and write code that integrates your tracing solution.
You can use your existing infrastructure, knowledge, and skillsets.
You can customize the system to fit your current and future needs.

But..most organizations simply don’t have the time, money, or experience to invest in a DIY project.

Method 2: Use Open Frameworks

In an “open” approach, you still write code, but you use an existing open, distributed tracing framework. OpenTracing and OpenCensus are two examples of popular open frameworks.

There are a number of advantages to these popular open frameworks.

Free and open: These tools are free and open-source and were developed by major tech companies, such as Google, Twitter, and Uber. You can integrate your code and management apps and configure them manually. You also can experiment with the open frameworks without the threat of vendor lock-in.
Language support: Support for most common high-level languages enables you to build your tracing and logging tools and integrate them into your existing applications and development environment.
Compatible management tools: Since open frameworks do generate large amounts of data, you can use tools such as Jaeger and Zipkin for tracing management and analytics.
High level of flexibility: You can build a generic system that you can customize to meet your needs. You can use each framework and management system on its own, combine them, or integrate with third-party solutions. Your solution can also include multiple types of services, design patterns, and communication protocols.

This open framework approach, however, relies on you to do the necessary coding and integration work. When you run into an issue with an open framework approach, you may have to solve it on your own or consult the community and its documentation.

Also, the more resources and developers you have available for this type of project, the better. Sufficient training can be an issue. The time expended on developing and maintaining manual tracing can be up to 30% of the total development time.

The following sample Python code shows how you could integrate an open framework with your application code.

from opencensus.trace.tracer import Tracer
from opencensus.trace import time_event as time_event_module
from opencensus.ext.zipkin.trace_exporter import ZipkinExporter
from opencensus.trace.samplers import always_on

ze = ZipkinExporter(service_name="dr-test", host_name='localhost', port=9411,
endpoint='/api/v2/spans')
tracer = Tracer(exporter=ze, sampler=always_on.AlwaysOnSampler())

def main():
   connection=pika.BlockingConnection(pika.ConnectionParameters
   (host='localhost'))
   channel = connection.channel()
   rabbit = RabbitMQHandler(host='localhost', port=15672)
   channel.queue_declare(queue='task_queue', durable=True)
   logger = logging.getLogger('send_message')
   with tracer.span(name="main") as span:
       message = ''.join(sys.argv[1:])
       channel.basic_publish(exchange='',
       
routing_key='task_queue',body=message,properties=pika.BasicProperties(delivery_mode=2))
       logging.info("Sent " + message)

    connection.close()

Method 3: Automated, Distributed and Cloud-based Tracing

Another alternative is to use a cloud-based tracing solution. It can automatically monitor any requests generated by your software and track them across multiple systems. Thus, at different stages of the request’s path, it can send notifications to alert you to problems or indicate the request’s progress.

The results that distributed tracing produces are both far more reliable and consistent than a DIY approach. And like the services you could build, these solutions are cross-platform and support multiple development stacks as well as high-level languages. Any data recorded by the system also can be viewed, analyzed, and presented in a number of visual formats and charts. An agentless solution can help you bypass extensive or any instrumentation and set up monitoring and troubleshooting quickly. There is zero maintenance and no heavy lifting or training required and very often no code changes.

Benefits of Distributed Tracing

Automatically monitor and alert
Reliable and consistent results
Cross-platform, multiple development stacks & languages
Visualization of data for analysis

Finding the Best Solution

Monitoring applications and services have always been challenging, but trying to monitor applications and services in distributed environments is harder.

Individual application and system logging and logging aggregators and legacy monitoring solutions really are unsuitable for microservices running in distributed environments. In terms of the three possible approaches to implementing and deploying distributed tracing, using an automated, distributed, agentless and cloud-native tracing solution for microservices is easier and more effective. The pillars then for tracing microservices should include:

Distributed tracing and logging that enables you to do monitoring and troubleshooting quickly and efficiently.
An agentless approach for any kind of service, whether if it’s running in a container, a VM or FaaS.
A fully automated experience with distributed tracing through every service in a matter of minutes with no coding and very little maintenance.
Visualization of the traces and architecture maps to provide confidence when developing new features.
The ability to search across the tracing payloads and logs helps to pinpoint and fix complex issues in seconds.

To learn more about containerized infrastructure and cloud native technologies, consider coming to KubeCon + CloudNativeCon NA, November 18-21 in San Diego.

The 5 Pillars of Distributed Tracing

Method 1: Monitoring Individual Services and Applications

Method 2: Log Aggregation

Fixing the Problem With Distributed Tracing

Method 1: Do It Yourself (DIY)

Method 2: Use Open Frameworks

Method 3: Automated, Distributed and Cloud-based Tracing

Benefits of Distributed Tracing

Finding the Best Solution

The 4 Defining Principals of ‘Next Architecture’

Architecting Kubernetes Storage for Limitless Efficiency

Method 1: Monitoring Individual Services and Applications

Method 2: Log Aggregation

Fixing the Problem With Distributed Tracing

Method 1: Do It Yourself (DIY)

Method 2: Use Open Frameworks

Method 3: Automated, Distributed and Cloud-based Tracing

Benefits of Distributed Tracing

Finding the Best Solution

The 4 Defining Principals of ‘Next Architecture’

Architecting Kubernetes Storage for Limitless Efficiency

You may also like

Why AI Agents Fail in Production Without Trusted Telemetry | Shahar Azulay, groundcover | TFiR

Why OpenTelemetry Is Now the Foundation for AI and Cloud Observability | Chris Aniszczyk, CNCF | TFiR

Why HA Health Checks Fail as Clusters Grow | Trey Isaac, SIOS Technology | TFiR

Why Cloud Development Feedback Loops Fail and How to Fix Them | Waldemar Hummer, LocalStack | TFiR

How Kubernetes 1.36 Handles GPU Scheduling, DRA, and Kubelet Security | Ryota Sawada, Kubernetes | TFiR

Your HA Backup System Has Hidden Gaps — SIOS Technology’s Trey Isaac Explains How to Find Them | TFiR