# Scalability

* [Get timeline optimization](#get-timeline-optimization)
* [Create tweet optimization](#create-tweet-optimization)
* [Scale pull](#scale-pull)
* [Scale push](#scale-push)
* [Cache algorithm](#cache-algorithm)
* [Hot spot / Thundering herd problem](#hot-spot--thundering-herd-problem)
* [References](#references)

## Get timeline optimization

![](https://1010073591-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Mk8dv8Mfudl_6ziUzDf%2Fuploads%2Fgit-blob-1d19a4f278220b4e67ce23954e114b08dd189469%2Ftwitter_getPathOptimization.png?alt=media)

## Create tweet optimization

* Write requests directly come to load balancer, without landing on CDN and reverse proxy.
* Write to database will happen through message queue to avoid sudden spike in requests due to tweets/newsfeed creation.

![](https://1010073591-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-Mk8dv8Mfudl_6ziUzDf%2Fuploads%2Fgit-blob-d7897c7d1dbe2ac310983fe0f44624c3ed89820f%2Ftwitter_writePathOptimization.png?alt=media)

## Scale pull

* Add cache before visiting DB, faster than 1000 times
* What to cache
  * Cache each user's timeline
    * N DB query request -> N cache requests
    * Trade off: Cache all timeline? Only cache the latest 1000 timeline
  * Cache each user's newsFeed
    * For users without newsfeed cache: Merge N followers' 100 latest tweets, sort and take the latest 100 tweets.
    * For users with newsfeed cache: Merge N followers' tweets after a specific timestamp. And then merge with the cache.

## Scale push

* Push-based approach stores news feed in disk, much better than the optimized pull approach.
* For inactive users, do not push
  * Rank followers by weight (for example, last login time)
* When number of followers > number of following
  * Lady Gaga has 62.5M followers on Twitter. Justin Bieber has 77.6M on Instagram. Asynchronous task may takes hours to finish.

## Cache algorithm

* LRU algorithm is not suitable for this case because during the pull-based model, it will get tweets for all followers from different servers. If a specific follower is not that popular, it stands high probability for cache miss in LRU algorithm.
* Instead, expiration based algorithm is used: Tweets within certain time frame is cached. For example, 7 days of tweets are cached.
  * UserID is the cache key and a user's tweets content is cache value.
  * According to estimation, 7 days' tweet content is 700GB.

## Hot spot / Thundering herd problem

* When superstar posts a tweet, if it is only cached in Redis, it would result in hot key problem.
* Superstars' tweets could be cached locally on web servers as well.

## References

* Extension read: Facebook lease get problem "Scaling Memcache at Facebook"
  * <https://dev.to/lazypro/handling-stale-sets-and-thundering-herds-of-cache-170c>
  *
* 京东热key探测框架设计与时间：
  * <https://mp.weixin.qq.com/s/xOzEj5HtCeh\\_ezHDPHw6Jw>
  * <https://gitee.com/jd-platform-opensource/hotkey>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://eric-zhang-seattle.gitbook.io/mess-around/scenarios/overview-5/scalability.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
