Typical topics
Intuition
Locality of reference
Long tail
Factors for hit ratio
Size of cache key space
The more unique cache keys your application generates, the less chance you have to reuse any one of them. Always consider ways to reduce the number of possible cache keys.
The number of items you can store in cache
The more objects you can physically fit into your cache, the better your cache hit ratio.
Longevity
How long each object can be stored in cache before expiring or being invalidated.
Applicable scenarios
short answer
How many times a cached piece of data can and is reused by the application
the proportion of response time that is alleviated by caching
In applications that are I/O bound, most of the response time is getting data from a database.
Selection criteria
Performance: Cache should be able to constantly sustain the throughput requirements in terms of read or write queries from application. So the more it is able to take advantage of resources like — RAM, SSD or Flash, CPU etc, the better it is at producing output.
Scalability: Caching system has to be able to maintain steady performance even if number of operations, requests, users & amount of data flow increases. It must be able to scale linearly without any adverse impact. So elastically growing up or down is an important characteristic.
Availability: High availability is the utmost requirement in today’s systems. It’s fine to get stale data ( depending on use case ) but unavailable systems are not desired. Whether there is a planned or unplanned outage, or a portion of system is crashed or due to natural calamity some data centre is non-operational, cache has to be available all the time.
Manageability: Easy deployment, monitoring, useful dashboard, real-time matrices make every developer & SRE’s life simple.
Simplicity: All other things equal, simplicity is always better. Adding a cache to your deployment should not introduce unnecessary complexity or make more work for developers.
Affordability: Cost is always a consideration with any IT decision, both upfront implementation as well as ongoing costs. Your evaluation should consider total cost of ownership, including license fees as well as hardware, services, maintenance, and support.
Real world
Cache warming at Netflix: https://netflixtechblog.com/cache-warming-agility-for-a-stateful-service-2d3b1da82642
From RAM to SSD Data caching at Netflix: https://netflixtechblog.com/evolution-of-application-data-caching-from-ram-to-ssd-a33d6fa7a690
Netflix SSD data caching: https://netflixtechblog.com/application-data-caching-using-ssds-5bf25df851ef
Netflix global caching: https://netflixtechblog.com/caching-for-a-global-netflix-7bcc457012f1
Instagram thundering problem: https://instagram-engineering.com/thundering-herds-promises-82191c8af57d
Background data fetching: https://instagram-engineering.com/improving-performance-with-background-data-prefetching-b191acb39898
Scaling Memcached at Facebook:
Reduce latency
Problem: Items are distributed across the memcached servers through consistent hashing. Thus web servers have to rountinely communicate with many memcached servers to satisfy a user request. As a result, all web servers communicate with every memcached server in a short period of time. This all-to-all communication pattern can cause incast congestion or allow a single server to become the bottleneck for many web servers.
Solution: Focus on the memcache client.
Reduce load
Problem: Use memcache to reduce the frequency of fetching data among more expensive paths such as database queries. Web servers fall back to these paths when the desired data is not cached.
Solution: Leases; Stale values;
Handling failures
Problem:
A small number of hosts are inaccessible due to a network or server failure.
A widespread outage that affects a significant percentage of the servers within the cluster.
Solution:
Small outages: Automated remediation system.
Gutter pool
In a region: Replication
Across regions: Consistency
TODO
Last updated