Master-slave KV
Distributed "sharding proxy"
Design thoughts
Master slave model
Master has the hashmap [Key, server address]
Slave is responsible for storing data
Read process
Client sends request of reading Key K to master server.
Master returns the server index by checking its consistent hashmap.
Client sends request of Key to slave server.
First check the Key pair inside memory.
Check the bloom filter for each file and decide which file might have this key.
Use the index to find the value for the key.
Read and return key, value pair
Write process
Clients send request of writing pair K,V to master server.
Master returns the server index
Clients send request of writing pair K,V to slave server.
Slave records the write operation inside write ahead log.
Slave writes directly go to the in-memory skip list.
If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.
Then create a new table/file.
How to handle race condition
Master server also has a distributed lock (such as Chubby/Zookeeper)
Distributed lock
Consistent hashmap is stored inside the lock server
(Optional) Too much data to store on slave local disk
Replace local disk with distributed file system (e.g. GFS) for
Disk size
Replica
Failure and recovery
Write ahead log and SsTable are all stored inside GFS.
How to write SsTable to GFS
Divide SsTable into multiple chunks (64MB) and store each chunk inside GFS.
Config server will easily become single point of failure
Client could cache the routing table
Flow chart
The dashboard lines means these network calls could be avoided if the routing table is cached on client.
Read process
Step1: Client sends request of reading Key K to master server.
Step2/3: Master server locks the key. Returns the server index by checking its consistent hashmap.
Step4: Client sends request of Key to slave server.
First check the Key pair inside memory.
Check the bloom filter for each file and decide which file might have this key.
Use the index to find the value for the key.
Read and return key, value pair
Read process finishes. Slave notifies the client.
Step5: The client notifies the master server to unlock the key.
Step6: Master unlocks the key
Write process
step1: Clients send request of writing pair K,V to master server.
step2/3: Master server locks the key. Returns the server index.
Step4: Clients send request of writing pair K,V to slave server.
Slave records the write operation inside write ahead log.
Slave writes directly go to the in-memory skip list.
If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.
Then create a new table/file.
Write process finishes. Slave notifies the client.
Step5: The client notifies the master server to unlock the key.
Step6: Master unlocks the key
Consistency model
Gossip
Raft based
Overview
Read process
There are multiple options for read consistency:
Default: Will read the old data sometimes
Consistent: Will not read the old data
Stale: Will read the old data
Write process
Client send http request to server.
After server receives the request, server will put the message into ProposeC channel of KVStore.
RaftNode submit the message to Raft Module's propose interface.
Raft module output Ready structure. After server persists the log entry, it will send it to other nodes.
After majority nodes persisted the log entry, the log entry will be submitted to KVStore for execution.
KVStore calls backend storage engine.
Data distribution model
Consistent hashing
Ref: Riak https://www.infoq.com/articles/dynamo-riak-random-slicing/
Access pattern
Read intensive: boltdb
Based on B+ tree
Performance benchmarks: A 3-node 8-core 16G cluster, linear read throughput is 190K QPS and write QPS is 50K QPS.
Write intensive: leveldb
Based on LSM tree
Concurrency
Performant bootup
Reference
using level DB and Rocks DB as an example - https://soulmachine.gitbooks.io/system-design/content/cn/key-value-store.html
Meituan build on top of tair and redis - https://tech.meituan.com/2020/07/01/kv-squirrel-cellar.html
TairDB
淘宝开源 Key/Value 结构数据存储系统 Tair 技术剖析 https://www.infoq.cn/article/taobao-tair/
hot key problem: https://zhuanlan.zhihu.com/p/32743904
Tair 分布式K-V存储方案 https://www.cnblogs.com/chenny7/p/4875396.html
LevelDB
Bitcask
https://medium.com/@arpitbhayani/bitcask-a-log-structured-fast-kv-store-c6c728a9536b
storage model-merge and hint files: https://topic.alibabacloud.com/a/implementation-of-the-bitcask-storage-model-merge-and-hint-files_8_8_31516931.html
Implement a custom key-value storage system: https://medium.com/@felipedutratine/implement-a-custom-key-value-storage-system-3df4c1eb35e9
TODO
Last updated