Debugging Distributed Systems: How to Overcome Common Challenges

 ● 19th Feb 2020

5 min read

distributed systems

Below we outline common approaches to distributed tracing, the challenges these methods pose when debugging distributed systems, and how OverOps can help deliver greater insights when troubleshooting across microservices.

The accelerated adoption of microservices and increasingly distributed systems brings the promise of greater speed, scalability and flexibility. But this shift to more modular architecture is not without its fair share of challenges – especially when it comes to troubleshooting.

Distributed tracing has emerged as a common method of troubleshooting employed by distributed applications, and particularly by those built using containerized architecture. This approach helps pinpoint where failures occur across multiple services and the root cause of poor performance.

While the concept is straightforward, distributed tracing is not an easy problem to solve. Below we break down some common challenges teams encounter with distributed tracing, as well as the most effective methods. 

The Top 3 Hurdles to Effective Distributed Tracing 

  1. A microservice could be invoked multiple times during a business transaction.
  2. A microservice could be called in any order. It’s extremely tough to install a predefined set of restrictions/conditions before a microservice is invoked.
  3. As data is modified as part of the business flow, tracking the data across the various microservices gets tricky. 

Implementation challenges also run afoul. Multiple microservices usually assume distinct business functions. Correlating between them could involve talking across multiple languages, protocols and varied lingo. A “statusID” or “statusDate” in one system could mean something completely different in another system.

Troubleshooting Distributed Transactions

The simplest way to get visibility into a distributed transaction process would be to use what is often referred to as ‘baggage’. “Baggage” is defined as unique identifiers assigned to each business transaction for tracking over the course of its lifetime. This could be a simple immutable ID such as “TransactionID” or “VisitorID” or “CustomerID” that could be used by the microservices. If such an immutable ID is not available, an alternative would be to create a unique “Baggage ID” that could be used across these services.

To troubleshoot effectively, developers need to follow a best practice of adding this “Baggage ID” to every exception, log error and warning they write into the system.

Tracking this “Baggage ID” across the various microservices would enable applications to “follow the baggage” across distributed applications.

*Note that the above design practice involves “code changes” to identify and log (if not already present) the unique identifier across the distributed transaction.

Distributed Tracing Today: An Introduction to Open Tracing Frameworks

There are two main ways that teams approach distributed tracing: 

Let’s start with OpenTracing. While this is not a standard, this comprises of an API specification, frameworks and libraries that have implemented the specification. OpenTracing allows developers to add instrumentation to their application code using APIs that do not lock them into any one particular product or vendor. More details here.

There are open source tools such as Jagger – that provide end-end distributed tracing and the ability to monitor and troubleshoot transactions in complex distributed systems

Multiple APM vendors have also taken a crack at solving this problem. While some provide OpenTracing API support, others use W3C standards to traverse the stack. Some vendors add a “traceID” to each incoming request (to the Http Header) and use this to track distributed calls. While there are a number of advantages and disadvantages of every approach, this blog will not explore APM vendor implementations.

How OverOps Enhances Traditional Troubleshooting Techniques

Where does OverOps fit into this picture? Before I get into this, it’s important to highlight that OverOps is not a distributed tracing tool. Having said this, using OverOps in conjunction with the above “Baggage ID” design pattern can provide much greater insights into a distributed transaction.

The first step is to enable OverOps’ Tiny Links (also known as ARC links). Doing this will associate any application anomaly (Exception, Throwable, Log Error, Log Warning and HttpError) with a Tiny Link. This in turn will result in associating a BaggageID with a corresponding Tiny Link. This unique association provides a number of advantages.

  1. Ability to find out “What, When and Why” when a distributed transaction breaks. By associating a “Baggage ID” with a tiny link, OverOps provides you with details on the exact microservice, version, line of source code and variables associated with the problem. In addition to this, basic error metrics such as error rate are also captured.
  1. Visibility across individual microservices of the business transaction – even if it doesn’t fail. This can be achieved by using log.warning for visibility purposes. OverOps will take a snapshot of all warnings that gives additional troubleshooting context even if the business transaction doesn’t fail.

For more information on OverOps tiny links, please refer here

To check out OverOps’ continuous reliability solution for yourself, sign up for a free trial or request a personalized demo.

Karthik Lalithraj is a Principal Solutions Architect at OverOps with focus on Code Quality and Application Reliability for the IT Services industry. With over 2 decades of software experience in a variety of roles and responsibilities, Karthik takes a holistic view of software architecture with special emphasis on helping enterprise IT organizations improve their service availability, application performance and scale. Karthik has successfully helped recruit and build enterprise teams, architected, designed and implemented business and technical solutions with numerous customers in various business verticals.

Troubleshooting Apache Spark Applications with OverOps OverOps’ ability to detect precisely why something broke and to see variable state is invaluable in a distributed compute environment.
Troubleshooting Apache Spark Applications with OverOps

Next Article

The Fastest Way to Why.

Eliminate the detective work of searching logs for the Cause of critical issues. Resolve issues in minutes.
Learn More