🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • Trace concepts
  • Properties
  • Usecase
  • Data model
  • Architecture
  • Data collection
  • Approaches
  • Distributed tracing solutions
  • 美团
  • Ali

Was this helpful?

  1. Time series scenarios
  2. Observability

Distributed traces

PreviousTimeSeries dataNextLogs

Last updated 3 years ago

Was this helpful?

Trace concepts

Properties

  • Definitions:

    • Traces—or more precisely, “distributed traces”—are samples of causal chains of events (or transactions) between different components in a microservices ecosystem. And like events and logs, traces are discrete and irregular in occurrence.

  • Properties:

    • Traces that are stitched together form special events called “spans”; spans help you track a causal chain through a microservices ecosystem for a single transaction. To accomplish this, each service passes correlation identifiers, known as “trace context,” to each other; this trace context is used to add attributes on the span.

Usecase

  • Trace data is needed when you care about the relationships between services/entities. If you only had raw events for each service in isolation, you’d have no way of reconstructing a single chain between services for a particular transaction.

  • Additionally, applications often call multiple other applications depending on the task they’re trying to accomplish; they also often process data in parallel, so the call-chain can be inconsistent and timing can be unreliable for correlation. The only way to ensure a consistent call-chain is to pass trace context between each service to uniquely identify a single transaction through the entire chain.

  • Optimize the calling chain. For example, if a service calls the other one repeatedly, could these requests being batched? Or could such requests be parallelized?

  • Locate the bottleneck service.

  • Optimize the network calls. e.g. Identify whether there are cross region calls

Data model

TraceID

  • TraceId could be used to concatenate the call logs of a request on each server.

Generation rule

  • Sample generation rule:

    • The TraceId is typically generated by the first server that receives the request. The generation rule is: server IP + generated time + incremental sequence + current process ID, such as:

  • Example: 0ad1348f1403169275002100356696

    • The first 8 digits 0ad1348f is the IP of the machine that generates TraceId. This is a hexadecimal number, in which every two digits represents a part of IP. Based on the number, we can get a common IP address like 10.209.52.143 by converting every two digits into a decimal number. According to this rule, you can also figure out the first server that the request goes through.

    • The next 13 digits 1403169275002 is the time to generate the TraceId.

    • The next 4 digits 1003 is an auto-incrementing sequence that increases from 1000 to 9000. After reaching 9000, it returns to 1000 and then restarts to increase.

    • The last 5 digits 56696 is the current process ID. Its role in tracerId is to prevent the TraceId conflicts caused by multiple processes in a single machine.

Sample rate

  • Sampling states applied to the trace ID, not the span ID.

  • There are four possible values for sample rate:

    • Accept: Decide to include

    • Debug: Within certain testing environments, always enable the sample.

    • Defer: Could not make the decision on whether to trace or not. For example, wait for certain proxy to make the decision.

    • Deny: Decide to exclude

  • The most common use of sampling is probablistic: eg, accept 0.01% of traces and deny the rest. Debug is the least common use case.

   Client Tracer                                                  Server Tracer     
┌───────────────────────┐                                       ┌───────────────────────┐
│                       │                                       │                       │
│   TraceContext        │          Http Request Headers         │   TraceContext        │
│ ┌───────────────────┐ │         ┌───────────────────┐         │ ┌───────────────────┐ │
│ │ TraceId           │ │         │ X-B3-TraceId      │         │ │ TraceId           │ │
│ │                   │ │         │                   │         │ │                   │ │
│ │ ParentSpanId      │ │ Inject  │ X-B3-ParentSpanId │ Extract │ │ ParentSpanId      │ │
│ │                   ├─┼────────>│                   ├─────────┼>│                   │ │
│ │ SpanId            │ │         │ X-B3-SpanId       │         │ │ SpanId            │ │
│ │                   │ │         │                   │         │ │                   │ │
│ │ Sampling decision │ │         │ X-B3-Sampled      │         │ │ Sampling decision │ │
│ └───────────────────┘ │         └───────────────────┘         │ └───────────────────┘ │
│                       │                                       │                       │
└───────────────────────┘                                       └───────────────────────┘

SpanID

  • Span ID could be used to determine the order of execution for all calls happened within the same Trace ID.

// Temporal relationships between Spans in a single Trace

––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–> time

 [Span A···················································]
   [Span B··············································]
      [Span D··········································]
    [Span C········································]
         [Span E·······]        [Span F··] [Span G··] [Span H··]


// Causal relationships between Spans in a single Trace


        [Span A]  ←←←(the root span)
            |
     +------+------+
     |             |
 [Span B]      [Span C] ←←←(Span C is a `ChildOf` Span A)
     |             |
 [Span D]      +---+-------+
               |           |
           [Span E]    [Span F] >>> [Span G] >>> [Span H]
                                       ↑
                                       ↑
                                       ↑
                         (Span G `FollowsFrom` Span F)

Parent spanId

  • This is one way of defining parent span Id. More commonly adopted.

Dot spanId

  • This is another way of defining parent span Id.

  • Cons: When a trace has too many calling layers, the dot spanId will carry too much redundant information.

Annotation

  • Basic description info related to the trace

Context propogation

  • A context will often have information identifying the current span and trace (e.g. SpanId / TraceId), and can contain arbitrary correlations as key-value pairs.

  • Propagation is the means by which context is bundled and transferred across.

  • The ability to correlate events across service boundaries is one of the principle concepts behind distributed tracing. To find these correlations, components in a distributed system need to be able to collect, store, and transfer metadata referred to as context.

Across threads

  • Use threadlocal to pass TraceID / SpanID

Across Restful style service APIs

  • There are several protocols for context propagation that OpenTelemetry recognizes.

    • W3C Trace-Context HTTP Propagator

    • W3C Correlation-Context HTTP Propagator

    • B3 Zipkin HTTP Propagator

Across components such as message queues / cache / DB

  1. Add the context variables inside message

    • Cons: temper with message

  2. Change message queue protocol

    • Cons: challenging

OpenTracing API standards**

  • Reference:

Architecture

Data collection

Asynchronous processing with bounded buffer queue

  • No matter what approach the data collector adopts, the threads for sending out telemetry data must be separated from business threads. Call it using a background threads pool.

  • There should be a queue between business threads and background threads. And this queue should have bounded size to avoid out of memory issue.

┌─────────────────────────────────────────────────────────────────────────────────┐                                            
│                                   Application                                   │                                            
│                                                                                 │                                            
│                                                                                 │                                            
│   ┌───────────────────┐       ┌───────────────┐       ┌─────────────────────┐   │                                            
│   │                   │       │               │       │                     │   │                                            
│   │                   │       │               │       │                     │   │     ┌────────────┐      ┌─────────────────┐
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │  Business logic   │       │ Bounded size  │       │                     │   │     │            │      │Log/Trace/Metrics│
│   │      threads      │──────▶│queue to avoid │──────▶│ Background threads  │   │────▶│Kafka / UDP │─────▶│    Processor    │
│   │                   │       │ Out of Memory │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     │            │      │                 │
│   │                   │       │               │       │                     │   │     └────────────┘      └─────────────────┘
│   │                   │       │               │       │                     │   │                                            
│   └───────────────────┘       └───────────────┘       └─────────────────────┘   │                                            
│                                                                                 │                                            
│                                                                                 │                                            
│                                                                                 │                                            
└─────────────────────────────────────────────────────────────────────────────────┘

Approaches

Manual tracing

  • Manually add tracing logs

AOP

Bytecode Instrumentation

Append to log files

  • Appender is responsible for outputing formatted logs to destinations such as disk files, console, etc. Then trace files could be processed in the similar way as log files.

    • When multiple threads use the same appender, there is a chance for resource contention. The append operation needs to be asynchronous. And to fit with asynchornous operation, there must be a buffer queue. Please

Data storage

Requirement analysis

  • No fixed data model but calling chain has a tree-structure.

  • Large amounts of data, would better be compressed.

    • Sample size figures: meituan 100TB per day

Column-family data storage

Data model for a normal trace

  • Use TraceID as rowKey

  • Has two columns

    • Basic info column: Basic info about trace

    • Calling info column: (Each remote service call has four phases)

      • P1: Client send

      • P2: Server receive

      • P3: Server send

      • P4: Client receive

  • Using HBase as an example for an ecommerce website

TraceId
0001
0002

Basic Info Column

Type: buy

Type: refund

Basic Info Column

Status: finished

Status: processing

Calling Info Column

SpanId 1 with P1 calling info

SpanId 1 with P1 calling info

Calling Info Column

SpanId 1 with P2 calling info

SpanId 1 with P2 calling info

Calling Info Column

SpanId 1 with P3 calling info

SpanId 1 with P3 calling info

Calling Info Column

SpanId 1 with P4 calling info

SpanId 1 with P4 calling info

Calling Info Column

SpanId 2 with P1 calling info

SpanId 2 with P1 calling info

Calling Info Column

SpanId 2 with P2 calling info

empty to be filled when finished

Calling Info Column

SpanId 2 with P3 calling info

... ...

Data model for a buiness trace

  • Motivation:

    • The above trace data model covers the case where all spans could be concatenated together with a trace ID. There are cases where multiple trace id needed to be concatenated to form a business chain.

    • For example, in ecommerce system, a customer could create an order, the revise an exsiting order, and later on cancel the order.

  • Also needs a column-family storage from traceID -> json blob and the reverse mapping from system transaction id -> trace ID

TraceID
Order system transaction ID
Payment system transaction ID
User system transaction ID

0001

1

2

3

0002

4

5

6

0003

7

8

9

Distributed file system

  • Each block needs corresponding 48 bits index data. Based on the trace Id, the index position could be decided.

  • The trace Id format could be defined in a way to make locating index and block data easier. For example, ShopWeb-0a010680-375030-2 traceId has four segments. The index file name could be defined as the "ShopWeb" + "0a010680" + "375030". And the block position could be inferred from the 4th segment.

    • ShopWeb: Application name

    • 0a010680: Current machine's IP address

    • 375030: Current time / hour

    • 2: Mono-increasing sequence number in the current unit

Distributed tracing solutions

OpenTracing

Solution inventory

  • 2014 Google Dapper

  • Alibaba EagleEye

  • Jingdong Hydra

  • Pinpoint (APM)

OpenZipkin

// Here’s an example sequence of http tracing where user code calls the resource /foo. This results in a single span, sent asynchronously to Zipkin after user code receives the http response.

// Trace instrumentation report spans asynchronously to prevent delays or failures relating to the tracing system from delaying or breaking user code.

┌─────────────┐ ┌───────────────────────┐  ┌─────────────┐  ┌──────────────────┐
│ User Code   │ │ Trace Instrumentation │  │ Http Client │  │ Zipkin Collector │
└─────────────┘ └───────────────────────┘  └─────────────┘  └──────────────────┘
       │                 │                         │                 │
           ┌─────────┐
       │ ──┤GET /foo ├─▶ │ ────┐                   │                 │
           └─────────┘         │ record tags
       │                 │ ◀───┘                   │                 │
                           ────┐
       │                 │     │ add trace headers │                 │
                           ◀───┘
       │                 │ ────┐                   │                 │
                               │ record timestamp
       │                 │ ◀───┘                   │                 │
                             ┌─────────────────┐
       │                 │ ──┤GET /foo         ├─▶ │                 │
                             │X-B3-TraceId: aa │     ────┐
       │                 │   │X-B3-SpanId: 6b  │   │     │           │
                             └─────────────────┘         │ invoke
       │                 │                         │     │ request   │
                                                         │
       │                 │                         │     │           │
                                 ┌────────┐          ◀───┘
       │                 │ ◀─────┤200 OK  ├─────── │                 │
                           ────┐ └────────┘
       │                 │     │ record duration   │                 │
            ┌────────┐     ◀───┘
       │ ◀──┤200 OK  ├── │                         │                 │
            └────────┘       ┌────────────────────────────────┐
       │                 │ ──┤ asynchronously report span     ├────▶ │
                             │                                │
                             │{                               │
                             │  "traceId": "aa",              │
                             │  "id": "6b",                   │
                             │  "name": "get",                │
                             │  "timestamp": 1483945573944000,│
                             │  "duration": 386000,           │
                             │  "annotations": [              │
                             │--snip--                        │
                             └────────────────────────────────┘

Pinpoint

Compare Pinpoint and OpenZipkin

  • Language support:

    • OpenZipkin has a broad language support, including C#、Go、Java、JavaScript、Ruby、Scala、PHP

    • PinPoint only support Java

  • Integration effort:

    • OpenZipkin's braven trace instrument api needs to be embedded inside business logic

    • Pinpoint uses Bytecode Instrumentation, Not Requiring Code Modifications.

  • Trace granularity:

    • OpenZipkin: Code level

    • Pinpoint: Granular at bytecode level

美团

Ali

  • 阿里eagle eye:

Reference:

On demand sampling:

Ref:

References:

Please see more in

Datadog and Opentracing:

Twitter Zipkin:

Pinpoint:

DaZhongDianPing CAT (Chinese):

Apache SkyWalking:

技术博客字节码:

美团技术深入分析开源框架CAT:

美团分布式追踪MTrace:

阿里云分布式链路文档:

Java instruments API:

移动端的监控:

即时消息系统端到端:

Trace concepts
Properties
Usecase
Data model
TraceID
Generation rule
Sample rate
SpanID
Parent spanId
Dot spanId
Annotation
Context propogation
Across threads
Across Restful style service APIs
Across components such as message queues / cache / DB
OpenTracing API standards**
Architecture
Data collection
Asynchronous processing with bounded buffer queue
Approaches
Manual tracing
AOP
Bytecode Instrumentation
Append to log files
Data storage
Requirement analysis
Column-family data storage
Data model for a normal trace
Data model for a buiness trace
Distributed file system
Distributed tracing solutions
OpenTracing
Solution inventory
OpenZipkin
Pinpoint
Compare Pinpoint and OpenZipkin
美团
Ali
https://github.com/openzipkin/b3-propagation
https://github.com/openzipkin-contrib/zipkin-secondary-sampling/blob/master/docs/design.md
https://www.sofastack.tech/en/projects/sofa-tracer/traceid-generated-rule/
https://www.programmersought.com/article/65184544752/
OpenTracing specification
Doc
In Chinese
https://www.datadoghq.com/blog/opentracing-datadog-cncf/
https://zipkin.io/pages/architecture.html
https://pinpoint-apm.github.io/pinpoint/
https://github.com/dianping/cat
https://github.com/apache/skywalking
https://zipkin.io/pages/architecture.html
https://pinpoint-apm.github.io/pinpoint/techdetail.html#bytecode-instrumentation-not-requiring-code-modifications
https://tech.meituan.com/2019/09/05/java-bytecode-enhancement.html
https://tech.meituan.com/2018/11/01/cat-in-depth-java-application-monitoring.html
https://zhuanlan.zhihu.com/p/23038157
https://help.aliyun.com/document_detail/133635.html
https://tech.meituan.com/2019/02/28/java-dynamic-trace.html
https://time.geekbang.org/dailylesson/topic/135
https://time.geekbang.org/column/article/146995?utm_source=related_read&utm_medium=article&utm_term=related_read
Distributed tracing