Cluster
Codis
key distribution model
Resharding
Redis cluster
Redis Cluster is designed to survive failures of a few nodes in the cluster, but it is not a suitable solution for applications that require availability in the event of large net splits.
Key distribution model
HASH_SLOT = CRC16(key) mod 16384
Hashtag could force multiple keys are allocated in the same slots and it is used to implement multi-key operations in redis cluster.
Resharding
Trigger conditions
Add a new node
Remove an existing node
Rebalance the cluster among nodes
Resharding commands
CLUSTER ADDSLOTS slot1 [slot2] ... [slotN]
CLUSTER DELSLOTS slot1 [slot2] ... [slotN]
CLUSTER SETSLOT slot NODE node
CLUSTER SETSLOT slot MIGRATING node
CLUSTER SETSLOT slot IMPORTING node
SETSLOT command internals
redis-trib sends target node CLUSTER SETSLOT $slot IMPORTING $source_id so that target node is prepared to import key value pairs from slot.
On the node side, there is a bitmap
redis-trib sends source node CLUSTER SETSLOT $slot MIGRATING $target_id so that source node is prepared to migrate key value pairs to slot.
On the node side, there is a bitmap
redis-trib sends source node CLUSTER GETKEYSINSLOT $slot $count to get at most count number of key names belonging to slot.
for every key name obtained in step 3, redis-trib will send source node a MIGRATE $target_ip $target_port $key_name 0 $time_out command to migrate the slots from source to dest node.
Repeat step 3 and 4 until all key-value pairs belong to the slots have been migrated.
redis-trib sends CLUSTER SETSLOT $slot NODE $target_id which will be broadcasted to all the nodes within the cluster.
Move redirection
MOVED means that we think the hash slot is permanently served by a different node and the next queries should be tried against the specified node,
Ask redirection
ASK means to send only the next query to the specified node.
ASK semantics for client:
If ASK redirection is received, send only the query that was redirected to the specified node but continue sending subsequent queries to the old node.
Start the redirected query with the ASKING command.
Don't yet update local client tables to map hash slot 8 to B.
ASK semantics for server:
If the client has flag REDIS_ASKING and clusterStates_importing_slots_from[i] shows node is importing key value i, then node will execute the the client command once.
Gossip protocol
Frequency
Each second, randomly picks 1/10 neighbor nodes from local routing table and sends PING message to the most delayed node.
Cons: If only relying on random pick, some nodes might have significant delay.
To compenstate cons of 1, redis cluster will scan local instance table. If there is a node which has not received PONG message bigger than configured cluster-node-timeout/2, then it will immediately send PING message to the node.
Size of PING/PONG message
A single PING message ~ 12KB
clusterMsgDataGossip itself is roughly 100 Bytes.
Suppose the cluster has 1000 nodes, then it will send
100 (random neighbor nodes) + 1 (node itself) = 100 * 100 bytes = 10KB
Bitmap of 16384 length: 16386 = 2^14 bits = 2^14 / 8 bytes = 2^11 bytes = 2KB
Receiving corresponding PONG message ~ 12KB
Bandwidth consumption with example
Suppose the cluster size is 1000.
Suppose every 100ms there are 10 instances who receive timeout PONG messages.
Each second this node will send 101 PING message.
Then each second this node will consume 101 * 12KB = 1.2 MB/s bandwidth outbound
1000 nodes will consume 1000 * 1.2 MB/s = 1.2 GB/s
Data skew
Data volume skew
Big key
Should avoid storing too many items in a single key-value pair
If the big key is a collection, then would better split collection
Uneven slot
For example, if machine 1 has a higher hardware setup than machine 2, then maintainers would probably allocate more slots on machine 1.
If there are too many slots on a machine, then slots should be migrated to another machine.
Hashtag
Motivation
Redis Cluster implements a concept called hash tags that can be used in order to force certain keys to be stored in the same hash slot.Hash tags are a way to ensure that multiple keys are allocated in the same hash slot. This is used in order to implement multi-key operations in Redis Cluster.
Computation of hash slot
In order to implement hash tags, the hash slot for a key is computed in a slightly different way in certain conditions.
If the key contains a "{...}" pattern only the substring between { and } is hashed in order to obtain the hash slot.
However since it is possible that there are multiple occurrences of { or } the algorithm is well specified by the following rules:
IF the key contains a { character.
AND IF there is a } character to the right of {
AND IF there are one or more characters between the first occurrence of { and the first occurrence of }.
Then instead of hashing the key, only what is between the first occurrence of { and the following first occurrence of } is hashed.
Examples
The two keys {user1000}.following and {user1000}.followers will hash to the same hash slot since only the substring user1000 will be hashed in order to compute the hash slot.
For the key foo{}{bar} the whole key will be hashed as usually since the first occurrence of { is followed by } on the right without characters in the middle.
For the key foo{{bar}}zap the substring {bar will be hashed, because it is the substring between the first occurrence of { and the first occurrence of } on its right.
For the key foo{bar}{zap} the substring bar will be hashed, since the algorithm stops at the first valid or invalid (without bytes inside) match of { and }.
What follows from the algorithm is that if the key starts with {}, it is guaranteed to be hashed as a whole. This is useful when using binary data as key names.
Use case and avoid data skew
Hashtag is typically used inside range queries / Transactions
There are other ways to solve range queires / transactions problems.
Request volume skew
Hot data
For each hot data, adds multiple replica 1-N. For each replica slot, makes sure its copy is randomly assigned to a slot.
For example, original key is "abc", then randomly assigned slots could be "abc1", "abc2", ...... "abc7"
Each time when a client wants to visit the node, add the random suffix to the key.
Only suitable for read-only data; For write operations, need to find a way to keep consistency.
High availability
Similar to how Redis Sentinel marks a node from subjective down state to objective down state. Please see redis sentinel high availability
Redis cluster node will marks a node from PFAIL to FAIL.
Step0: Health detection - PFAIL to FAIL state
PFAIL: PFAIL means Possible failure, and is a non-acknowledged failure type. A node flags another node with the PFAIL flag when the node is not reachable for more than NODE_TIMEOUT time. Both master and slave nodes can flag another node as PFAIL, regardless of its type.
FAIL: FAIL means that a node is failing and that this condition was confirmed by a majority of masters within a fixed amount of time.
PFAIL => FAIL: A PFAIL condition is escalated to a FAIL condition when the following set of conditions are met:
Some node, that we'll call A, has another node B flagged as PFAIL.
Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster.
The majority of masters signaled the PFAIL or FAIL condition within NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT time. (The validity factor is set to 2 in the current implementation, so this is just two times the NODE_TIMEOUT time).
FAIL => Normal: FAIL flag can only be cleared in the following situations:
The node is already reachable and is a slave. In this case the FAIL flag can be cleared as slaves are not failed over.
The node is already reachable and is a master not serving any slot. In this case the FAIL flag can be cleared as masters without slots do not really participate in the cluster and are waiting to be configured in order to join the cluster.
The node is already reachable and is a master, but a long time (N times the NODE_TIMEOUT) has elapsed without any detectable slave promotion. It's better for it to rejoin the cluster and continue in this case.
Weak agreement
PFAIL => FAIL is a week agreement because:
Nodes collect views of other nodes over some time period, so even if the majority of master nodes need to agree, actually this is just state that we collected from different nodes at different times and we are not sure, nor we require, that at a given moment the majority of masters agreed.
While every node detecting the FAIL condition will force that condition on other nodes in the cluster using the FAIL message, there is no way to ensure the message will reach all the nodes. For instance a node may detect the FAIL condition and because of a partition will not be able to reach any other node.
PFAIL => FAIL is an eventually consistency agreement because:
Eventually all the nodes should agree about the state of a given node. There are two cases that can originate from split brain conditions. Either some minority of nodes believe the node is in FAIL state, or a minority of nodes believe the node is not in FAIL state. In both the cases eventually the cluster will have a single view of the state of a given node:
Case 1: If a majority of masters have flagged a node as FAIL, because of failure detection and the chain effect it generates, every other node will eventually flag the master as FAIL, since in the specified window of time enough failures will be reported.
Case 2: When only a minority of masters have flagged a node as FAIL, the slave promotion will not happen (as it uses a more formal algorithm that makes sure everybody knows about the promotion eventually) and every node will clear the FAIL state as per the FAIL state clearing rules above (i.e. no promotion after N times the NODE_TIMEOUT has elapsed).
FAIL propogation
The FAIL message will force every receiving node to mark the node in FAIL state, whether or not it already flagged the node in PFAIL state.
Step1: Slave election and promotion
Condition to start the election
The slave's master is in FAIL state. As soon as a master is in FAIL state, a slave waits a short period of time before trying to get elected. That delay is computed as follows:
DELAY = 500 milliseconds + random delay between 0 and 500 milliseconds + SLAVE_RANK * 1000 milliseconds.
The fixed delay ensures that we wait for the FAIL state to propagate across the cluster, otherwise the slave may try to get elected while the masters are still unaware of the FAIL state, refusing to grant their vote.
The random delay is used to desynchronize slaves so they're unlikely to start an election at the same time.
The SLAVE_RANK is the rank of this slave regarding the amount of replication data it has processed from the master. Slaves exchange messages when the master is failing in order to establish a (best effort) rank: the slave with the most updated replication offset is at rank 0, the second most updated at rank 1, and so forth. In this way the most updated slaves try to get elected before others.
The master was serving a non-zero number of slots.
The slave replication link was disconnected from the master for no longer than a given amount of time, in order to ensure the promoted slave's data is reasonably fresh. This time is user configurable.
A slave increments its currentEpoch counter, and requests votes from master instances.
Votes are requested by the slave by broadcasting a FAILOVER_AUTH_REQUEST packet to every master node of the cluster. Then it waits for a maximum time of two times the NODE_TIMEOUT for replies to arrive (but always for at least 2 seconds).
Once the slave receives ACKs from the majority of masters, it wins the election. Otherwise if the majority is not reached within the period of two times NODE_TIMEOUT (but always at least 2 seconds), the election is aborted and a new one will be tried again after NODE_TIMEOUT * 4 (and always at least 4 seconds).
A master grant the vote if the following conditions are met
A master only votes a single time for a given epoch, and refuses to vote for older epochs: every master has a lastVoteEpoch field and will refuse to vote again as long as the currentEpoch in the auth request packet is not greater than the lastVoteEpoch. When a master replies positively to a vote request, the lastVoteEpoch is updated accordingly, and safely stored on disk.
A master votes for a slave only if the slave's master is flagged as FAIL.
Auth requests with a currentEpoch that is less than the master currentEpoch are ignored. Because of this the master reply will always have the same currentEpoch as the auth request. If the same slave asks again to be voted, incrementing the currentEpoch, it is guaranteed that an old delayed reply from the master can not be accepted for the new vote.
Once a master has voted for a given slave, replying positively with a FAILOVER_AUTH_ACK, it can no longer vote for another slave of the same master for a period of NODE_TIMEOUT * 2. In this period it will not be able to reply to other authorization requests for the same master.
A slave discards any AUTH_ACK replies with an epoch that is less than the currentEpoch at the time the vote request was sent. This ensures it doesn't count votes intended for a previous election.
Once a slave wins the election, it obtains a new unique and incremental configEpoch which is higher than that of any other existing master. It starts advertising itself as master in ping and pong packets, providing the set of served slots with a configEpoch that will win over the past ones.
Step2: Configuration epoch
Cluster current epoch
currentEpoch lifetime
At node creation every Redis Cluster node, both slaves and master nodes, set the currentEpoch to 0.
Every time a packet is received from another node, if the epoch of the sender (part of the cluster bus messages header) is greater than the local node epoch, the currentEpoch is updated to the sender epoch.
Because of these semantics, eventually all the nodes will agree to the greatest configEpoch in the cluster.
currentEpoch use case
Currently this happens only during slave promotion, as described in the next section. Basically the epoch is a logical clock for the cluster and dictates that given information wins over one with a smaller epoch.
Configuration epoch
configEpoch lifetime
The configEpoch is set to zero in masters when a new node is created.
A new configEpoch is created during slave election. Slaves trying to replace failing masters increment their epoch and try to get authorization from a majority of masters. When a slave is authorized, a new unique configEpoch is created and the slave turns into a master using the new configEpoch.
configEpoch use case
configEpoch helps to resolve conflicts when different nodes claim divergent configurations (a condition that may happen because of network partitions and node failures).
Last updated