# High availability

* [Disk fault tolerance](#disk-fault-tolerance)
* [ChunkServer/DataNode high availability](#chunkserverdatanode-high-availability)
* [Master/NameNode high availability](#masternamenode-high-availability)
  * [First layer defense: Restart master](#first-layer-defense-restart-master)
  * [Second layer defense: Master backup](#second-layer-defense-master-backup)
  * [Third layer defense: Shadow backup](#third-layer-defense-shadow-backup)
    * [Motivation](#motivation)
    * [Inconsistency](#inconsistency)

## Disk fault tolerance

## ChunkServer/DataNode high availability

## Master/NameNode high availability

![](https://1010073591-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Mk8dv8Mfudl_6ziUzDf%2Fuploads%2Fgit-blob-bc3e4fa179d225d06e2d2a1c8d10f19ba5daf37d%2Fmaster_high_availability.png?alt=media)

![](https://1010073591-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Mk8dv8Mfudl_6ziUzDf%2Fuploads%2Fgit-blob-fb59ac6b7da26325abeb4382a38314d0a11fa0df%2Fmaster_high_availability_timeSeries.png?alt=media)

### First layer defense: Restart master

* All master metadata is cached inside memory. There will be checkpoints where all memory is dumped to disk.
* If the master has software failures, then it will first recover from checkpoints. And then operation logs after that timestamp will be replayed.

### Second layer defense: Master backup

* The above procedure could handle software but not hardware failures.
* If the master has hardware failures, then it could failover to the backups which master synchronously replicates to.
* The switch from master to master backup is by canonical name

### Third layer defense: Shadow backup

#### Motivation

* The switch process could take seconds or minutes to complete. The worst case switch process needs
  1. Monitor program detects the master failure.
  2. Restarting master, loading data from disk checkpoints and replaying operation logs after the timestamp don't help.
  3. Starting the switch from master to master backup by changing the canonical name.

#### Inconsistency

* The data in shadow back might be stale. But the chance that client read stale metadata from shadow backup is quite slim because it only happens when all these three conditions are met:
  * Master is dead.
  * The metadata on master has not completely been replicated to shadow backup.
  * The data clients is trying to read is just these metadata not replicated yet.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eric-zhang-seattle.gitbook.io/mess-around/big-data/gfs-hdfs-overview/gfs-hdfs-highavailability.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
