Data flow

In GFS case
- It uses 100Mbps network card, it has a maximum throughput of 12.5MB/s.
- It uses 5400 rpm disk, its bandwidth is usually 60~90MB/s. And multiple hard disks could be plugged together, and there will be a maximum bandwidth of 500MB/s.
- The bottleneck is in network layer.

Master metadata

GFS client sends file name and chunk offset to GFS master. Since all chunks have the same size of 64MB, chunk index could be easily calculated.
After GFS master gets the request, it will return addresses of chunk servers to clients.
After GFS client gets addresses, it could reach out to any of it to get chunk data.

Master only tells GFS client which chunk servers to read/write data. After that it is out of the business.

Client queries master for locations of chunk servers.
Master replies with primary and secondary replica locations of chunk servers.
Client sends data to all replicas (by picking the nearest replica first). However, after secondary replicas receive the data, they will not immediately write it to disk. Instead, they will cache it in the memory.
After all secondary replicas receive data, clients will send a write request to primary replica. Primary replica will order all the write requests.
Primary replica will forward all write requests to secondary replicas. Then all secondary replicas will write data to disk with the same order.
After secondary replicas finish writing, they will reply to primary replica that they have finished.
Primary replica will tell clients that write requests have completed successfully.

Data might not first be transmitted to primary replica. It depends on which replica is closer to the client.
Then the closer replica will send the data to the next replica.

Last updated 1 year ago

Was this helpful?