Distributed traces

Trace concepts

Properties

  • Definitions:

    • Traces—or more precisely, “distributed traces”—are samples of causal chains of events (or transactions) between different components in a microservices ecosystem. And like events and logs, traces are discrete and irregular in occurrence.

  • Properties:

    • Traces that are stitched together form special events called “spans”; spans help you track a causal chain through a microservices ecosystem for a single transaction. To accomplish this, each service passes correlation identifiers, known as “trace context,” to each other; this trace context is used to add attributes on the span.

Usecase

  • Trace data is needed when you care about the relationships between services/entities. If you only had raw events for each service in isolation, you’d have no way of reconstructing a single chain between services for a particular transaction.

  • Additionally, applications often call multiple other applications depending on the task they’re trying to accomplish; they also often process data in parallel, so the call-chain can be inconsistent and timing can be unreliable for correlation. The only way to ensure a consistent call-chain is to pass trace context between each service to uniquely identify a single transaction through the entire chain.

  • Optimize the calling chain. For example, if a service calls the other one repeatedly, could these requests being batched? Or could such requests be parallelized?

  • Locate the bottleneck service.

  • Optimize the network calls. e.g. Identify whether there are cross region calls

Data model

TraceID

  • TraceId could be used to concatenate the call logs of a request on each server.

Generation rule

  • Sample generation rule:

    • The TraceId is typically generated by the first server that receives the request. The generation rule is: server IP + generated time + incremental sequence + current process ID, such as:

  • Example: 0ad1348f1403169275002100356696

    • The first 8 digits 0ad1348f is the IP of the machine that generates TraceId. This is a hexadecimal number, in which every two digits represents a part of IP. Based on the number, we can get a common IP address like 10.209.52.143 by converting every two digits into a decimal number. According to this rule, you can also figure out the first server that the request goes through.

    • The next 13 digits 1403169275002 is the time to generate the TraceId.

    • The next 4 digits 1003 is an auto-incrementing sequence that increases from 1000 to 9000. After reaching 9000, it returns to 1000 and then restarts to increase.

    • The last 5 digits 56696 is the current process ID. Its role in tracerId is to prevent the TraceId conflicts caused by multiple processes in a single machine.

Sample rate

   Client Tracer                                                  Server Tracer     
┌───────────────────────┐                                       ┌───────────────────────┐
│                       │                                       │                       │
│   TraceContext        │          Http Request Headers         │   TraceContext        │
│ ┌───────────────────┐ │         ┌───────────────────┐         │ ┌───────────────────┐ │
│ │ TraceId           │ │         │ X-B3-TraceId      │         │ │ TraceId           │ │
│ │                   │ │         │                   │         │ │                   │ │
│ │ ParentSpanId      │ │ Inject  │ X-B3-ParentSpanId │ Extract │ │ ParentSpanId      │ │
│ │                   ├─┼────────>│                   ├─────────┼>│                   │ │
│ │ SpanId            │ │         │ X-B3-SpanId       │         │ │ SpanId            │ │
│ │                   │ │         │                   │         │ │                   │ │
│ │ Sampling decision │ │         │ X-B3-Sampled      │         │ │ Sampling decision │ │
│ └───────────────────┘ │         └───────────────────┘         │ └───────────────────┘ │
│                       │                                       │                       │
└───────────────────────┘                                       └───────────────────────┘

SpanID

  • Span ID could be used to determine the order of execution for all calls happened within the same Trace ID.

// Temporal relationships between Spans in a single Trace

––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–> time

 [Span A···················································]
   [Span B··············································]
      [Span D··········································]
    [Span C········································]
         [Span E·······]        [Span F··] [Span G··] [Span H··]


// Causal relationships between Spans in a single Trace


        [Span A]  ←←←(the root span)
            |
     +------+------+
     |             |
 [Span B]      [Span C] ←←←(Span C is a `ChildOf` Span A)
     |             |
 [Span D]      +---+-------+
               |           |
           [Span E]    [Span F] >>> [Span G] >>> [Span H]



                         (Span G `FollowsFrom` Span F)

Parent spanId

Dot spanId

  • This is another way of defining parent span Id.

  • Cons: When a trace has too many calling layers, the dot spanId will carry too much redundant information.

Annotation

  • Basic description info related to the trace

Context propogation

  • A context will often have information identifying the current span and trace (e.g. SpanId / TraceId), and can contain arbitrary correlations as key-value pairs.

  • Propagation is the means by which context is bundled and transferred across.

  • The ability to correlate events across service boundaries is one of the principle concepts behind distributed tracing. To find these correlations, components in a distributed system need to be able to collect, store, and transfer metadata referred to as context.

Across threads

Across Restful style service APIs

  • There are several protocols for context propagation that OpenTelemetry recognizes.

    • W3C Trace-Context HTTP Propagator

    • W3C Correlation-Context HTTP Propagator

    • B3 Zipkin HTTP Propagator

Across components such as message queues / cache / DB

  1. Add the context variables inside message

    • Cons: temper with message

  2. Change message queue protocol

    • Cons: challenging

OpenTracing API standards**

Architecture

Data collection

Asynchronous processing with bounded buffer queue

  • No matter what approach the data collector adopts, the threads for sending out telemetry data must be separated from business threads. Call it using a background threads pool.

  • There should be a queue between business threads and background threads. And this queue should have bounded size to avoid out of memory issue.

┌─────────────────────────────────────────────────────────────────────────────────┐                                            
│                                   Application                                   │                                            
│                                                                                 │                                            
│                                                                                 │                                            
│   ┌───────────────────┐       ┌───────────────┐       ┌─────────────────────┐   │                                            
│   │                   │       │               │       │                     │   │                                            
│   │                   │       │               │       │                     │   │     ┌────────────┐      ┌─────────────────┐
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │  Business logic   │       │ Bounded size  │       │                     │   │     │            │      │Log/Trace/Metrics│
│   │      threads      │──────▶│queue to avoid │──────▶│ Background threads  │   │────▶│Kafka / UDP │─────▶│    Processor    │
│   │                   │       │ Out of Memory │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     └────────────┘      └─────────────────┘
│   │                   │       │               │       │                     │   │                                            
│   └───────────────────┘       └───────────────┘       └─────────────────────┘   │                                            
│                                                                                 │                                            
│                                                                                 │                                            
│                                                                                 │                                            
└─────────────────────────────────────────────────────────────────────────────────┘

Approaches

Manual tracing

  • Manually add tracing logs

AOP

Bytecode Instrumentation

Append to log files

  • Appender is responsible for outputing formatted logs to destinations such as disk files, console, etc. Then trace files could be processed in the similar way as log files.

    • When multiple threads use the same appender, there is a chance for resource contention. The append operation needs to be asynchronous. And to fit with asynchornous operation, there must be a buffer queue. Please

Data storage

Requirement analysis

  • No fixed data model but calling chain has a tree-structure.

  • Large amounts of data, would better be compressed.

    • Sample size figures: meituan 100TB per day

Column-family data storage

Data model for a normal trace

  • Use TraceID as rowKey

  • Has two columns

    • Basic info column: Basic info about trace

    • Calling info column: (Each remote service call has four phases)

      • P1: Client send

      • P2: Server receive

      • P3: Server send

      • P4: Client receive

  • Using HBase as an example for an ecommerce website

TraceId00010002

Basic Info Column

Type: buy

Type: refund

Basic Info Column

Status: finished

Status: processing

Calling Info Column

SpanId 1 with P1 calling info

SpanId 1 with P1 calling info

Calling Info Column

SpanId 1 with P2 calling info

SpanId 1 with P2 calling info

Calling Info Column

SpanId 1 with P3 calling info

SpanId 1 with P3 calling info

Calling Info Column

SpanId 1 with P4 calling info

SpanId 1 with P4 calling info

Calling Info Column

SpanId 2 with P1 calling info

SpanId 2 with P1 calling info

Calling Info Column

SpanId 2 with P2 calling info

empty to be filled when finished

Calling Info Column

SpanId 2 with P3 calling info

... ...

Data model for a buiness trace

  • Motivation:

    • The above trace data model covers the case where all spans could be concatenated together with a trace ID. There are cases where multiple trace id needed to be concatenated to form a business chain.

    • For example, in ecommerce system, a customer could create an order, the revise an exsiting order, and later on cancel the order.

  • Also needs a column-family storage from traceID -> json blob and the reverse mapping from system transaction id -> trace ID

TraceIDOrder system transaction IDPayment system transaction IDUser system transaction ID

0001

1

2

3

0002

4

5

6

0003

7

8

9

Distributed file system

  • Each block needs corresponding 48 bits index data. Based on the trace Id, the index position could be decided.

  • The trace Id format could be defined in a way to make locating index and block data easier. For example, ShopWeb-0a010680-375030-2 traceId has four segments. The index file name could be defined as the "ShopWeb" + "0a010680" + "375030". And the block position could be inferred from the 4th segment.

    • ShopWeb: Application name

    • 0a010680: Current machine's IP address

    • 375030: Current time / hour

    • 2: Mono-increasing sequence number in the current unit

Distributed tracing solutions

OpenTracing

Solution inventory

OpenZipkin

// Here’s an example sequence of http tracing where user code calls the resource /foo. This results in a single span, sent asynchronously to Zipkin after user code receives the http response.

// Trace instrumentation report spans asynchronously to prevent delays or failures relating to the tracing system from delaying or breaking user code.

┌─────────────┐ ┌───────────────────────┐  ┌─────────────┐  ┌──────────────────┐
│ User Code   │ │ Trace Instrumentation │  │ Http Client │  │ Zipkin Collector │
└─────────────┘ └───────────────────────┘  └─────────────┘  └──────────────────┘
       │                 │                         │                 │
           ┌─────────┐
       │ ──┤GET /foo ├─▶ │ ────┐                   │                 │
           └─────────┘         │ record tags
       │                 │ ◀───┘                   │                 │
                           ────┐
       │                 │     │ add trace headers │                 │
                           ◀───┘
       │                 │ ────┐                   │                 │
                               │ record timestamp
       │                 │ ◀───┘                   │                 │
                             ┌─────────────────┐
       │                 │ ──┤GET /foo         ├─▶ │                 │
                             │X-B3-TraceId: aa │     ────┐
       │                 │   │X-B3-SpanId: 6b  │   │     │           │
                             └─────────────────┘         │ invoke
       │                 │                         │     │ request   │

       │                 │                         │     │           │
                                 ┌────────┐          ◀───┘
       │                 │ ◀─────┤200 OK  ├─────── │                 │
                           ────┐ └────────┘
       │                 │     │ record duration   │                 │
            ┌────────┐     ◀───┘
       │ ◀──┤200 OK  ├── │                         │                 │
            └────────┘       ┌────────────────────────────────┐
       │                 │ ──┤ asynchronously report span     ├────▶ │
                             │                                │
                             │{                               │
                             │  "traceId": "aa",              │
                             │  "id": "6b",                   │
                             │  "name": "get",                │
                             │  "timestamp": 1483945573944000,│
                             │  "duration": 386000,           │
                             │  "annotations": [              │
                             │--snip--                        │
                             └────────────────────────────────┘

Pinpoint

Compare Pinpoint and OpenZipkin

  • Language support:

    • OpenZipkin has a broad language support, including C#、Go、Java、JavaScript、Ruby、Scala、PHP

    • PinPoint only support Java

  • Integration effort:

    • OpenZipkin's braven trace instrument api needs to be embedded inside business logic

    • Pinpoint uses Bytecode Instrumentation, Not Requiring Code Modifications.

  • Trace granularity:

    • OpenZipkin: Code level

    • Pinpoint: Granular at bytecode level

美团

Ali

Last updated