🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • Distributed "sharding proxy"
  • Design thoughts
  • Flow chart
  • Read process
  • Write process
  • Consistency model
  • Gossip
  • Raft based
  • Data distribution model
  • Consistent hashing
  • Access pattern
  • Read intensive: boltdb
  • Write intensive: leveldb
  • Concurrency
  • Performant bootup
  • Reference
  • TairDB
  • LevelDB
  • Bitcask
  • TODO

Was this helpful?

  1. Scenarios
  2. Key value store

Master-slave KV

PreviousKey value storeNextPeer-to-peer KV

Last updated 1 year ago

Was this helpful?

Distributed "sharding proxy"

Design thoughts

  1. Master slave model

    • Master has the hashmap [Key, server address]

    • Slave is responsible for storing data

    • Read process

      1. Client sends request of reading Key K to master server.

      2. Master returns the server index by checking its consistent hashmap.

      3. Client sends request of Key to slave server.

        1. First check the Key pair inside memory.

        2. Check the bloom filter for each file and decide which file might have this key.

        3. Use the index to find the value for the key.

        4. Read and return key, value pair

    • Write process

      1. Clients send request of writing pair K,V to master server.

      2. Master returns the server index

      3. Clients send request of writing pair K,V to slave server.

        1. Slave records the write operation inside write ahead log.

        2. Slave writes directly go to the in-memory skip list.

        3. If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.

        4. Then create a new table/file.

  2. How to handle race condition

    • Master server also has a distributed lock (such as Chubby/Zookeeper)

    • Distributed lock

      • Consistent hashmap is stored inside the lock server

  3. (Optional) Too much data to store on slave local disk

    • Replace local disk with distributed file system (e.g. GFS) for

      • Disk size

      • Replica

      • Failure and recovery

    • Write ahead log and SsTable are all stored inside GFS.

      • How to write SsTable to GFS

        • Divide SsTable into multiple chunks (64MB) and store each chunk inside GFS.

  4. Config server will easily become single point of failure

    • Client could cache the routing table

Flow chart

  • The dashboard lines means these network calls could be avoided if the routing table is cached on client.

                                         ┌─────────────────────────────────┐      
                                         │          Config server          │      
 ┌────────────────────┐ ─ ─ ─step5─ ─ ─▶ │   (where routing table stays)   │      
 │       Client       │                  │                                 │      
 │  ┌──────────────┐  │                  │ ┌───────────┐     ┌───────────┐ │      
 │  │cache of      │  │                  │ │           │     │           │ │      
 │  │routing table │  │  ─ ─Step1─ ─ ─ ▶ │ │  Master   │     │   Slave   │ │      
 │  └──────────────┘  │                  │ │           │     │           │ │      
 └────────────────────┘ ◀─ ─ ─ Step3 ─ ─ │ └───────────┘     └───────────┘ │      
            │                            └─────────────────────────────────┘      
            │                                           │      │                  
            │                                                                     
            │                                          Step2  step 6              
            │                                                                     
            │                                           ▼      ▼                  
            └─────────Step4─────────────────────┐  ┌──────────────┐               
                                                │  │ Distributed  │               
                                                │  │     lock     │       ─       
                                                │  └──────────────┘               
                                                │                                 
                                                │                                 
                                                ▼                                 
┌───────────────┐     ┌───────────────┐    ┌───────────────┐     ┌───────────────┐
│ Data server 1 │     │ Data server 2 │    │               │     │ Data server N │
│               │     │               │    │               │     │               │
│┌────────────┐ │     │┌────────────┐ │    │    ......     │     │┌────────────┐ │
││in-memory   │ │     ││in-memory   │ │    │               │     ││in-memory   │ │
││sorted list │ │     ││sorted list │ │    │               │     ││sorted list │ │
│└────────────┘ │     │└────────────┘ │    └───────────────┘     │└────────────┘ │
│┌────────────┐ │     │┌────────────┐ │                          │┌────────────┐ │
││in-disk     │ │     ││in-disk     │ │                          ││in-disk     │ │
││sorted list │ │     ││sorted list │ │                          ││sorted list │ │
││1 and bloom │ │     ││1 and bloom │ │                          ││1 and bloom │ │
││filter/index│ │     ││filter/index│ │                          ││filter/index│ │
│└────────────┘ │     │└────────────┘ │                          │└────────────┘ │
│┌────────────┐ │     │┌────────────┐ │                          │┌────────────┐ │
││......      │ │     ││......      │ │                          ││......      │ │
│└────────────┘ │     │└────────────┘ │                          │└────────────┘ │
│┌────────────┐ │     │┌────────────┐ │                          │┌────────────┐ │
││in-disk     │ │     ││in-disk     │ │                          ││in-disk     │ │
││sorted list │ │     ││sorted list │ │                          ││sorted list │ │
││N and bloom │ │     ││N and bloom │ │                          ││N and bloom │ │
││filter/index│ │     ││filter/index│ │                          ││filter/index│ │
│└────────────┘ │     │└────────────┘ │                          │└────────────┘ │
└───────────────┘     └───────────────┘                          └───────────────┘

Read process

  1. Step1: Client sends request of reading Key K to master server.

  2. Step2/3: Master server locks the key. Returns the server index by checking its consistent hashmap.

  3. Step4: Client sends request of Key to slave server.

    1. First check the Key pair inside memory.

    2. Check the bloom filter for each file and decide which file might have this key.

    3. Use the index to find the value for the key.

    4. Read and return key, value pair

    5. Read process finishes. Slave notifies the client.

  4. Step5: The client notifies the master server to unlock the key.

  5. Step6: Master unlocks the key

Write process

  1. step1: Clients send request of writing pair K,V to master server.

  2. step2/3: Master server locks the key. Returns the server index.

  3. Step4: Clients send request of writing pair K,V to slave server.

    1. Slave records the write operation inside write ahead log.

    2. Slave writes directly go to the in-memory skip list.

    3. If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.

    4. Then create a new table/file.

    5. Write process finishes. Slave notifies the client.

  4. Step5: The client notifies the master server to unlock the key.

  5. Step6: Master unlocks the key

Consistency model

Gossip

Raft based

Overview

Read process

  • There are multiple options for read consistency:

    • Default: Will read the old data sometimes

    • Consistent: Will not read the old data

    • Stale: Will read the old data

Write process

  1. Client send http request to server.

  2. After server receives the request, server will put the message into ProposeC channel of KVStore.

  3. RaftNode submit the message to Raft Module's propose interface.

  4. Raft module output Ready structure. After server persists the log entry, it will send it to other nodes.

  5. After majority nodes persisted the log entry, the log entry will be submitted to KVStore for execution.

  6. KVStore calls backend storage engine.

Data distribution model

Consistent hashing

  • Ref: Riak https://www.infoq.com/articles/dynamo-riak-random-slicing/

Access pattern

Read intensive: boltdb

  • Based on B+ tree

  • Performance benchmarks: A 3-node 8-core 16G cluster, linear read throughput is 190K QPS and write QPS is 50K QPS.

Write intensive: leveldb

  • Based on LSM tree

Concurrency

Performant bootup

Reference

TairDB

  • 淘宝开源 Key/Value 结构数据存储系统 Tair 技术剖析 https://www.infoq.cn/article/taobao-tair/

LevelDB

  • LevelDB:

  • Disk IO

  • 极客时间:

Bitcask

  • https://medium.com/@arpitbhayani/bitcask-a-log-structured-fast-kv-store-c6c728a9536b

  • storage model-merge and hint files: https://topic.alibabacloud.com/a/implementation-of-the-bitcask-storage-model-merge-and-hint-files_8_8_31516931.html

  • Implement a custom key-value storage system: https://medium.com/@felipedutratine/implement-a-custom-key-value-storage-system-3df4c1eb35e9

TODO

using level DB and Rocks DB as an example -

Meituan build on top of tair and redis -

hot key problem:

Tair 分布式K-V存储方案

Distributed "sharding proxy"
Design thoughts
Flow chart
Read process
Write process
Consistency model
Gossip
Raft based
Overview
Read process
Write process
Data distribution model
Consistent hashing
Access pattern
Read intensive: boltdb
Write intensive: leveldb
Concurrency
Performant bootup
Reference
TairDB
LevelDB
Bitcask
TODO
https://soulmachine.gitbooks.io/system-design/content/cn/key-value-store.html
https://tech.meituan.com/2020/07/01/kv-squirrel-cellar.html
https://zhuanlan.zhihu.com/p/32743904
https://www.cnblogs.com/chenny7/p/4875396.html
https://leveldb-handbook.readthedocs.io/zh/latest/basic.html
https://zhuanlan.zhihu.com/p/51360281
https://medium.com/databasss/on-disk-io-part-1-flavours-of-io-8e1ace1de017
https://time.geekbang.org/column/article/347136
https://time.geekbang.org/column/article/217049
Build keyvalue store ontop of MySQL with Uber example
How does Riak use Version Vectors to resolve conflicts
How Cassandra handles conflicts
Harvard presentation: Scaling write intensive key value stores