Last updated
Last updated
Write a file
Read a file
Use multiple machines to store these files
Metadata
FileInfo
Name = dengchao.mp4
CreatedTime = 201505031232
Size = 2044323
Index
Block 11 -> diskOffset1
Block 12 -> diskOffset2
Block 13 -> diskOffset3
Block
1 block = 1024 Byte
Advantages
Error checking
Fragmenting the data for storage
Change chunk size
1 chunk = 64M = 64 * 1024K
Advantages
Reduce size of metadata
Disadvantages
Waste space for small files
Peer 2 Peer (BitComet, Cassandra)
Advantage: No single point of failure
Disadvantage: Multiple machines need to negotiate with each other
Master slave
Advantage: Simple design. Easy to keep data consistent
Disadvantage: Master is a single point of failure
Final decision
Master + slave
Restart the single master
One master + many chunk servers
Master don't record the disk offset of a chunk
Advantage: Reduce the size of metadata in master; Reduce the traffic between master and chunk server
The client divides the file into chunks. Create a chunk index for each chunk
Send (FileName, chunk index) to master and master replies with assigned chunk servers
The client transfer data with the assigned chunk server.
The client sends (FileName) to master and receives a chunk list (chunk index, chunk server) from the master
The client connects with different server for reading files
Store metadata for different files
Store Map (file name + chunk index -> chunk server)
Find corresponding server when reading in data
Write to more available chunk server
Double master (Apache Hadoop Goes Realtime at Facebook)
Multi master (Paxos algorithm)
Check sum 4bytes = 32 bit
Each chunk has a checksum
Write checksum when writing out a chunk
Check checsum when reading in a chunk
Replica: 3 copies
Two copies in the same data center but on different racks
Third copy in a different data center
How to choose chunk servers
Find servers which are not busy
Find servers with lots of available disk space
Ask master for help
Heart beat message
Client only writes to a leader chunk server. The leader chunk server is responsible for communicating with other chunk servers.
How to select leading slaves
Ask the client to retry