Resilience
Idempotent
Def
The PUT method is idempotent. An idempotent method means that the result of a successful performed request is independent of the number of times it is executed.
Implementation
Idempotency could be implemented in different layers of the service architecture.
For example, idempotency + distributed lock in business logic layer
For example, use database uniqueness constraints to implement in database layer
REST API layer
POST is NOT idempotent.
GET, PUT, DELETE, HEAD, OPTIONS and TRACE are idempotent.
https://restfulapi.net/idempotent-rest-apis/
Idempotent CREATE in DB layer
Ref: https://brandur.org/http-transactions
Example: Insert user values (uid, email) where uid is the primary key
POST /users?email=jane@example.com
Unique constraint on email
Serializable transaction
Idempotent background job
Example: starts a transaction, executes a few DB operations, and queues a job somewhere in the middle:
Why put job queuing in transaction?
If you queue a job after a transaction is committed, you run the risk of your program crashing after the commit, but before the job makes it to the queue. Data is persisted, but the background work doesn’t get done.
Failure cases
Case1 : If your queue is fast, the job enqueued by queue_job() is likely to fail. A worker starts running it before its enclosing transaction is committed, and it fails to access data that it expected to be available.
Case2: A related problem are transaction rollbacks. In these cases data is discarded completely, and jobs inserted into the queue will never succeed no matter how many times they’re retried.
Transactionally-staged job drain
Ref: https://brandur.org/job-drain
A way around this is to create a job staging table into our database. Instead of sending jobs to the queue directly, they’re sent to a staging table first, and an enqueuer pulls them out in batches and puts them to the job queue.
The enqueuer selects jobs, enqueues them, and then removes them from the staging table. Because jobs are inserted into the staging table from within a transaction, its isolation property (ACID’s “I”) guarantees that they’re not visible to any other transaction until after the inserting transaction commits. A staged job that’s rolled back is never seen by the enqueuer, and doesn’t make it to the job queue.
The enqueuer is also totally resistant to job loss. Jobs are only removed after they’re successfully transmitted to the queue, so even if the worker dies partway through, it will pick back up again and send along any jobs that it missed. At least once delivery semantics are guaranteed.
A rough implementation
Advantages over in-database queues
https://brandur.org/job-drain#in-database-queues
Application layer with idempotent key
https://brandur.org/idempotency-keys
Def
An idempotency key is a unique value that’s generated by a client and sent to an API along with a request. The server stores the key to use for bookkeeping the status of that request on its end. If a request should fail partway through, the client retries with the same idempotency key value, and the server uses it to look up the request’s state and continue from where it left off.
For request-level idempotency, a random and unique key should be chosen from the client in order to ensure idempotency for the entire entity collection level. For example, if we wanted to allow multiple, different payments for a reservation booking (such as Pay Less Upfront), we just need to make sure the idempotency keys are different. UUID is a good example format to use for this.
Entity-level idempotency is far more stringent and restrictive than request-level idempotency. Say we want to ensure that a given $10 payment with ID 1234 would only be refunded $5 once, since we can technically make $5 refund requests twice. We would then want to use a deterministic idempotency key based on the entity model to ensure entity-level idempotency. An example format would be “payment-1234-refund”. Every refund request for a unique payment would consequently be idempotent at the entity-level (Payment 1234).
How to pass
A common way to transmit an idempotency key is through an HTTP header:
Retention policy
Keys are not meant to be used as a permanent request archive but rather as a mechanism for ensuring near-term correctness. Servers should recycle them out of the system beyond a horizon where they won’t be of much use – say 24 hours or so.
Idempotent DB schema design
idempotency_key is unique across each user so that it’s possible to have the same idempotency key for different requests as long as it’s across different user accounts.
Example with rocket ride
Transaction1: Insert idempotency key
Transaction2: Create ride and audit record
Transaction3: Call stripe
Business layer with distributed lock
Distributed lock
Scenario: Request only once within a short time window. e.g. User click accidently twice on the order button.
Please see Distributed lock
https://www.alibabacloud.com/blog/four-major-technologies-behind-the-microservices-architecture_596216
Idempotent with external systems
While each integration with PSPs and banks is different, we can distinguish two integrations styles
API-based integration with modern PSPs
API-based integrations with modern PSP integrations, with REST-based APIs, exchanging data in JSON, one transaction at a time, near-real time
Problem
Payments operations use several PSPs in a complex arrangement, and another PSP may be used if a payment fails with the originally selected one. Such practice may improve collection rate, but naively retrying a failed operation on another PSP may lead to double charging.
Solution
The approach Uber uses to avoid this problem is by using dedicated request storage consulted when a retry needs to be performed, to ensure that retry goes back to an original service (Figure 12).
Legacy batch integration with banks
Integrations are done by exchanging files via SFTP, with relatively low frequency (day or hours).
Retry
Simple client retries
Exponential backoff with jittery
Cons
Clogged batch processing. When we are required to process a large number of messages in real time, repeatedly failed messages can clog batch processing. The worst offenders consistently exceed the retry limit, which also means that they take the longest and use the most resources. Without a success response, the Kafka consumer will not commit a new offset and the batches with these bad messages would be blocked, as they are re-consumed again and again, as illustrated in figure below.
Difficulty retrieving metadata. It can be cumbersome to obtain metadata on the retries, such as timestamps and nth retry.
Retry queues to rescue
Multi-layer retry queue
A separate group of retry consumers will read off their corresponding retry queue. These consumers behave like those in the original architecture, except that they consume from a different Kafka topic. Meanwhile, executing multiple retries is accomplished by creating multiple topics, with a different set of listeners subscribed to each retry topic. When the handler of a particular topic returns an error response for a given message, it will publish that message to the next retry topic below it
Delays before retry: each subsequent level of retry consumers can enforce a processing delay, in other words, a timeout that increases as a message steps down through each retry topic. This mechanism follows a leaky bucket pattern where flow rate is expressed by the blocking nature of the delayed message consumption within the retry queues. Consequently, our queues are not so much retry queues as they are delayed processing queues, where the re-execution of error cases is our best-effort delivery: handler invocation will occur at least after the configured timeout but possibly later.
Deadletter queue
If requests continue to fail retry after retry, we want to collect these failures in a DLQ for visibility and diagnosis. A DLQ should allow listing for viewing the contents of the queue, purging for clearing those contents, and merging for reprocessing the dead-lettered messages, allowing comprehensive resolution for all failures affected by a shared issue.
References
Stripe API Idempotency: https://stripe.com/blog/idempotency
Last updated