High availability
Fault tolerant heartbeat messages
Reactive heartbeat messages
Proactive: Registry center proactively calls service providers.
Cons: Registry center needs to loop through all service providers regularly. There will be some delay.
Reactive: Service providers reports heartbeat messages to service registry.
This will better suit for scenarios when there are lots of service providers.
Report frequency
Usually the health ping frequency is set to 30s. This will avoid too much pressure on the server, and at the same time avoid too much delay in catching a node's health states.
Subhealth criteria
A State transition between death, health and subhealth. An interesting question is how to decide the threshold for a node to transit from health to subhealth?
Both application layer and service layer health info needs to be gathered. For application layer, the RPS/response time of each API will be different, so simply setting threshold for total failure or TPS. Use the percentage of success / total as standards.
Resilient to network latency
Deploy detectors across different locations.
But set up a threshold (like 40%) to avoid remove all nodes due to network problems.
Event subscription
RPC client subscribes to certain services
Take the example of Zookeeper: Use watch mechanism
How to avoid notification storm
Problem: Suppose a service provider has 100 nodes and each node has 100 consumers. Then when there is an update in the service provider, there will be 100*100 notifications generated.
Solution:
Capacity planning for registry center.
Scale up registry center.
Only transmit incremental information.
Fault tolerance on servers
Fault tolerance on clients
Fault tolerance on service registry
Choose among service registry frameworks. Choose AP oriented service registry if needed.
Typically, use AP model on large clusters and CP model on small clusters. Here AP and CP mean that
AP: Service registry will be alive but you may get wrong service entry.
CP: Service registry might crash and you get nothing.
|
|
|
|
|
CAP model | CP | CP | AP | CP |
Consensus protocol | ZAB (Paxos) | Raft | Not applicable | Raft |
Integration mechanism | SDK client | HTTP/gRPC | HTTP | HTTP/DNS |
Watch support | Support | Long polling | Long polling | Long polling |
KV storage | Support | Support | Not support | Support |
Written language | Java | Go | Java | Go |
What if service registry goes down in the worst case
Start backup service registry
Manually start the node
Integration mechanism
In-app solutions are typically suitable when both service providers and consumers belong to the same technology stack. Such as Euruka
Out-app solutions are typically suitable in cloud apps (container). Such as Consul
Last updated