Last updated
Last updated
Scaling up typically requires doubling the number of machines. Otherwise, lots of data move across partitions need to happen to make data evenly distribution.
Sharding strategy is not transparent for developers.
If using time as dimension, whether you should shard by year, month, or day. This year's volume might be 10X when compared with previous year; If by month or day, then there is the challenge of promotion seasons such as black friday.
Within mySQL cluster, each server could have a backup. However, it typically needs manual intervention for the switch when there is a harddrive failure.
Elasticity: Randomly add/reduce number of servers
Dynamic load shedding: Adjust the load on a single node
Fault tolerant: Minority of machine going down won't impact the availability.
The same column family will be stored together.
A single value could be stored with multiple versions.
BigTable will dynamically allocate data to different partitions.
BigTable uses Master + Chubby to manage the partition information.
Provides online data read/write service
Note: tablet server is not responsible for storing the actual data.
Assign tablets to tablet server
Examine the addition and expiration of tablet servers
Balance load of tablet servers
Garbage collect on data stored in GFS
Guarantee that there is only one master
Store the bootstrap location for bigtable
Discover tablet servers and cleanup after their termination
Store access control list
If storing the tablets mapping info inside master, then master will become a SPOF. And there are ways such as backup/shadow master which could improve availability.
A outside service could be used to monitor the health of master. However, how to guarantee the network connection between outside service and master.
Chubby is the outside service which has five servers. It will use consensus algorithm like Paxos to gaurantee that there is no false positive.
Bigtable stores the root tablet location in a never-changing position. Root table is the first partition of metadata table and it will never be partitioned further.
Metadata table stores the mapping of tablets. It is similar to the information_schema table in MySQL.
User tablet
Stores the location of user created data
All data is stored in ECOMMERCE_ORDERS table and look for order ID A20210101RST
The three layer storage architecture makes it easy for big table to scale.
Each single record in metadata table is 1KB.
The upper limit of metadata tablet is 128MB.
Three layer hierarchy could store (128*1000)^2 = 2^34
When looking for where a tablet is, the load is evenly distributed.
For the root table where everyone needs to look, its location never gets changed and could be cached by client.
During the entire access path, it does not need to pass through master.
Support three operations:
Random read according to row key
Random write according to row key
Ordered traverse according to row key
Record the write operation inside write ahead log.
Write directly goes to the memtable (In-memory sorted list).
If the in-memory skip list reaches its maximum capacity, sort it and write it to disk as a Sstable. At the same time create index and bloom filter for it.
Write: How to Save disk space. Consume too much disk space due to repetitive entries (Key, Value)
Have a background process doing K-way merge for the sorted tables regularly
Then create a new table/file.
Durability with WAL
What if memory is lost?
Problem: Nth in memory table is lost.
Write ahead log / WAL: The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure. Have a balance between between latency and durability.
process
First check the Key inside in-memory skip list.
Check the bloom filter for each file and decide which file might have this key.
Use the index to find the value for the key.
Read and return key, value pair.
It needs to read
Index / bloomfilter / cache
Index
Each sorted table should have an index inside memory. The index is a sketch of key value pairs
More advanced way to build index with B tree.
Bloom filter
Each sorted table should have a bloomfilter inside memory.
On a single SS table, there are two levels of cache
Scan cache:
Block cache:
GFS does not have any consistency guarantee for random write.
Bigtable has consistency guarantee for random write.
Optimized for write: Write only happens to in-memory sorted list
In the worst case, read needs to go through a chain of units (in-memory, in-disk N, ..., in-disk 1)
Compaction could help reduce the problem