Data flow
Identify the bottleneck
In GFS case
It uses 100Mbps network card, it has a maximum throughput of 12.5MB/s.
It uses 5400 rpm disk, its bandwidth is usually 60~90MB/s. And multiple hard disks could be plugged together, and there will be a maximum bandwidth of 500MB/s.
The bottleneck is in network layer.
Master metadata
File and chunk namespace
Mapping from file full path name to chunk handle list.
Mapping from chunk handle list to chunk server list.
Chunk server data
Chunk size of 64MB
GFS also uses namespace + filename to identify a file.
Each file is divided to chunk of size 64MB.
GFS client knows which chunk server to find the file.
Read flowchart
GFS client sends file name and chunk offset to GFS master. Since all chunks have the same size of 64MB, chunk index could be easily calculated.
After GFS master gets the request, it will return addresses of chunk servers to clients.
After GFS client gets addresses, it could reach out to any of it to get chunk data.
Write flowchart
Separation of control signal and data flow
Master only tells GFS client which chunk servers to read/write data. After that it is out of the business.
Client queries master for locations of chunk servers.
Master replies with primary and secondary replica locations of chunk servers.
Client sends data to all replicas (by picking the nearest replica first). However, after secondary replicas receive the data, they will not immediately write it to disk. Instead, they will cache it in the memory.
After all secondary replicas receive data, clients will send a write request to primary replica. Primary replica will order all the write requests.
Primary replica will forward all write requests to secondary replicas. Then all secondary replicas will write data to disk with the same order.
After secondary replicas finish writing, they will reply to primary replica that they have finished.
Primary replica will tell clients that write requests have completed successfully.
Closest replica first
Data might not first be transmitted to primary replica. It depends on which replica is closer to the client.
Then the closer replica will send the data to the next replica.
Reason
All servers on the same rack will be plugged to the same access switch.
All switches on the rack will connect to a single aggregate switch.
Aggregate switches will connect to core switch.
Last updated