Big table/Hbase
History
Pain points of mySQL
Scaling up typically requires doubling the number of machines. Otherwise, lots of data move across partitions need to happen to make data evenly distribution.
Sharding strategy is not transparent for developers.
If using time as dimension, whether you should shard by year, month, or day. This year's volume might be 10X when compared with previous year; If by month or day, then there is the challenge of promotion seasons such as black friday.
Within mySQL cluster, each server could have a backup. However, it typically needs manual intervention for the switch when there is a harddrive failure.
Motivation
Elasticity: Randomly add/reduce number of servers
Dynamic load shedding: Adjust the load on a single node
Fault tolerant: Minority of machine going down won't impact the availability.
Data model
The same column family will be stored together.
A single value could be stored with multiple versions.
API
Components
BigTable will dynamically allocate data to different partitions.
BigTable uses Master + Chubby to manage the partition information.
Tablet server
Provides online data read/write service
Note: tablet server is not responsible for storing the actual data.
Master
Assign tablets to tablet server
Examine the addition and expiration of tablet servers
Balance load of tablet servers
Garbage collect on data stored in GFS
Chubby
Guarantee that there is only one master
Store the bootstrap location for bigtable
Discover tablet servers and cleanup after their termination
Store access control list
SPOF in master without Chubby
If storing the tablets mapping info inside master, then master will become a SPOF. And there are ways such as backup/shadow master which could improve availability.
A outside service could be used to monitor the health of master. However, how to guarantee the network connection between outside service and master.
Chubby is the outside service which has five servers. It will use consensus algorithm like Paxos to gaurantee that there is no false positive.
Horizontal scalability
Three layer tablets
Root tablet
Bigtable stores the root tablet location in a never-changing position. Root table is the first partition of metadata table and it will never be partitioned further.
Metadata table
Metadata table stores the mapping of tablets. It is similar to the information_schema table in MySQL.
User tablet
Stores the location of user created data
Flowchart
All data is stored in ECOMMERCE_ORDERS table and look for order ID A20210101RST
Benefits
Large storage capacity
The three layer storage architecture makes it easy for big table to scale.
Each single record in metadata table is 1KB.
The upper limit of metadata tablet is 128MB.
Three layer hierarchy could store (128*1000)^2 = 2^34
Evenly distributed load for metadata
When looking for where a tablet is, the load is evenly distributed.
For the root table where everyone needs to look, its location never gets changed and could be cached by client.
During the entire access path, it does not need to pass through master.
High throughput design
flow chart
Memtable
Support three operations:
Random read according to row key
Random write according to row key
Ordered traverse according to row key
High throughput write process
Record the write operation inside write ahead log.
Write directly goes to the memtable (In-memory sorted list).
If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.
Write: How to Save disk space. Consume too much disk space due to repetitive entries (Key, Value)
Have a background process doing K-way merge for the sorted tables regularly
Then create a new table/file.
Durability with WAL
What if memory is lost?
Problem: Nth in memory table is lost.
Write ahead log / WAL: The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure. Have a balance between between latency and durability.
High throughput read
process
First check the Key inside in-memory skip list.
Check the bloom filter for each file and decide which file might have this key.
Use the index to find the value for the key.
Read and return key, value pair.
It needs to read
Index / bloomfilter / cache
Index
Each sorted table should have an index inside memory. The index is a sketch of key value pairs
More advanced way to build index with B tree.
Bloom filter
Each sorted table should have a bloomfilter inside memory.
On a single SS table, there are two levels of cache
Scan cache:
Block cache:
High throughput
GFS does not have any consistency guarantee for random write.
Bigtable has consistency guarantee for random write.
Pros
Optimized for write: Write only happens to in-memory sorted list
Cons
In the worst case, read needs to go through a chain of units (in-memory, in-disk N, ..., in-disk 1)
Compaction could help reduce the problem
References
Last updated