🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • Architecture tradeoff analysis
  • Review Rubrics
  • Non-functional requirements (NFRs)
  • Decision chart
  • COGS
  • Capacity planning
  • Performance estimation
  • Stress testing tools
  • Scale numbers with examples
  • Typeahead service
  • Instant messaging app
  • Video Streaming
  • Newsfeed
  • Photo sharing
  • File system
  • Geo location
  • References

Was this helpful?

ArchitectureTradeOffAnalysis

PreviousREADMENextEstimation

Last updated 1 year ago

Was this helpful?

Architecture tradeoff analysis

Review Rubrics

Soft skills

  • Requirements gathering

  • Make decisions and tradeoffs with justification

  • Describe the solution using concise language and accurate technical terms

Hard skills

  • Design quality; scalability; reliability, efficiency etc (L4/L5)

  • Basic facts about existing software and hardware capabilities (L4 partly, L5)

  • Project lifecycle awareness, e.g. How a project is developed and maintained (L5)

Non-functional requirements (NFRs)

Type
Description

Performance

Efficiency such as throughput and response time

Availability

Uptime percentage in a year

Scalability

As number of nodes increases, service capability increases linearly

Extensibility

Pluggable and easiness to add new functionalities

Security

Privacy and security

Observability

Able to detect problems and get root cause quickly

Testability

Easy to test different componentss

Robustness

Fault tolerance and fast recovery, high robustness usually indicates high availability

Portability / Compatibility

Support for different OS, hardwares, softwares (browsers, etc) and versions

Consistency

Support for different OS, hardwares, softwares (browsers, etc) and versions

Availability

Availability percentage and service downtime

Commodity hardware failure trend

  • If your system has 4-5 systems and dozens of database servers (around 10) on the critical path, and assume the failure rate as 2%, then each year you will encounter twice disk failure scenarios.

Decision chart

  • [TODO: Decison chart]

COGS

Commodity hardware

  • Two Intel Xeon E5-2623 v3’s (quad core) – $900 total

  • 128GB RAM (using 8GB DIMMs) – $1,920

  • Two 512GB SSDs for fast storage – $450

  • Six 4TB hard drives for slow storage – $900

  • Grand total: $5,070

Capacity planning

1. Get a baseline: MAU and DAU

  • The benchmarks above show the average stickiness of products for various industries. It is calculated as (DAU/MAU)*100. The chart also mentions the median along with the average because medians are less likely to be skewed by outliers.

  • For the SaaS industry, the average stickiness is 13% which means slightly less than 4 days of activity/month/user. The Median for the SaaS industry is 9.4%, implying less than 3 days of activity/per user per month.

  • Multiply DAU/WAU * WAU / MAU to get actual DAU/MAU ratio:

    • Facebook: ~72%

    • Ecommerce:

      • Amazon: 17%

      • Walmart: 15%

      • eBay: 3%

    • Finance:

      • Paypal: 12.5%

      • Venmo: 10%

    • Uber: 12.5%

    • Netflix: 3%

    • Groupon: 4.5%

  • References:

2. Growth speed

  • For fast growing data (e.g. order data in ecommerce website), use 2X planned capacity to avoid resharding

  • For slow growing data (e.g. user identity data in ecommerce website), use 3-year estimated capacity to avoid resharding.

3. Divide capacity by system capability

Single Kafka instance

  • Single machine write: 250K (50MB) messages per second

  • Single machine read: 550K (110MB) messages per second

Appendix: Conversions

Power of two

Power of two
10 based number
Short name

10

1 thousand (10^3)

1 KB

20

1 million (10^6)

1 MB

30

1 billion (10^9)

1 GB

40

1 trillion (10^12)

1 TB

50

1 quadrillion (10^15)

1 PB

Time scale conversion

  • Total seconds in a day: 86400 ~ 10^5

  • 2.5 million requests per month: 1 request per second

  • 100 million requests per month: 40 requests per second

  • 1 billion requests per month: 400 requests per second

Performance estimation

Memory

  • Random access: 300K times / s

  • Sequential access: 5M times / s

  • Size: GB level per second

  • Read 1MB memory data takes 0.25ms

Disk IO

  • Operating system page size for read and write: 4KB

  • SATA mechanical hard disk

    • IOPS: 120 times / s

    • Sequential read size: 100MB / s

    • Random read size: 2MB / s

    • Sector size: 0.5KB

  • SSD hard disk: Speed similar to memory

    • 0.1-0.2ms

    • Sector size: 4KB

Network latency

  • Single DC network round trip: 0.5ms

  • Multi DC network round trip: 30-100ms

  • Usually set timeout value for RPC within a single DC as 500ms

  • Interactive latency checker (A scroll bar in the top for different year)

Typical API latency

  • [TODO: Add a section for typical API latency]

Load balancing design

  • Example: Design load balancing mechanism for an application with 10M DAU (e.g. Github has around 10M DAU)

  • Traffic voluem estimation

  • 10M DAU. Suppose each user operate 10 times a day. Then the QPS will be roughly ~ 1160 QPS

  • Peak value 10 times average traffic ~ 11600 QPS

  • Suppose volume need to increase due to static resource, microservices. Suppose 10. QPS ~ 116000 QPS.

  • Capacity planning

  • Multiple DC: QPS * 2 = 232000

  • Half-year volume increase: QPS * 1.5 = 348000

  • Mechanism

  • No DNS layer

  • LVS

Stress testing tools

  • MySqlslap: Shipped together with MySQL. Could not perform long time stress test.

  • Sysbench: Works on MacOS and Linux.

  • JMeter: Only basic functionality for database pressure testing.

Scale numbers with examples

Typeahead service

Google search

  • Google has been visited 62.19 billion times this year.

  • Google processes over 3.5 billion searches per day.

    • It means that Google processes over 40,000 search queries every second on average. Let’s also take a look at how Google’s searches per year have progressed. In 1998, Google was processing over 10,000 search queries per day. In comparison, by the end of 2006, the same amount of searches would be processed by Google in a single second.

  • 84 percent of respondents use Google 3+ times a day or more often.

    • Google has 92.18 percent of the market share as of July 2019.

  • More than one billion questions have been asked on Google Lens.

  • 63 percent of Google’s US organic search traffic originated from mobile devices.

  • Facebook was the most searched keyword on Google.

  • 46 percent of product searches begin on Google.

  • 90 percent of survey respondents said they were likely to click on the first set of results.

Instant messaging app

  • Whatsapp: 1.6 billion MAU

  • Facebook Messenger: 1.3 billion MAU

  • Wechat: 1.1 billion MAU

  • Snapchat: 0.3 billion MAU

  • Telegram: 0.2 billion MAU

Microsoft Teams

  • 140 million DAU

  • 240 million MAU

Whatsapp

  • 1.6 billion WhatsApp users access the app on a monthly basis. 53 percent of WhatsApp users in the US use the app at least once a day.

  • More than 65 billion messages are sent via WhatsApp every day. In other words, that boils down to 2.7 billion per hour, 45 million per minute, and more than 750,000 per second.

  • WhatsApp was downloaded 96 million times in February 2020.

  • WhatsApp is available in more than 180 countries and 60 different languages.

  • With 340 million users, India is WhatsApp’s biggest market.

  • There are more than five million businesses using WhatsApp Business.

Video Streaming

Netflix

  • 200 million subscribers Q4/2020. US has 74 million subscribers.

    • vs Amazon Prime - 150 million subscribers

    • vs Hulu - 39 million subscribers

  • Subscribers spent 3.2 hours per day watching Netflix

// Watch video RPS
100 M daily active users * 2 hours per day spent by each subscriber / total seconds

Youtube

  • 2.3 billion MAU

  • 720,000 hours of video uploaded daily

    • 500 hours of video uploaded every minute

    • (2012) 4 billion hours of video watched every day. 60 hours of video is uploaded every minute. 350+ million devices are YouTube enabled.

    • 8.4 minutes per person per day if everyone watches Youtube

  • Second most popular search after Google

  • Localized in 100 countries and 80 languages

  • 70% of traffic come from mobile

Newsfeed

Twitter

  • There are 330m monthly active users and 145 million daily users.

  • There are 500 million tweets sent each day. That’s 6,000 tweets every second.

  • A total of 1.3 billion accounts have been created.

  • Of those, 44% made an account and left before ever sending a tweet.

  • Based on US accounts, 10% of users write 80% of tweets.

  • During the 2014 FIFA World Cup Final, 618,725 tweets were sent in a single minute.

Facebook

Photo sharing

Instagram

  • In total 250 billion photo since 2004.

  • Photo uploads total 300 million per day

  • 243,055 new photos uploaded per minute

  • 127 photos uploaded on average per Facebook user

  • There are 1.074 billion Instagram MAU worldwide in 2021.

  • Instagram users spend an average of 53 minutes per day.

  • Dec, 2012: more than 25 photos and 90 likes every second.

File system

Dropbox

  • Assume the application has 50 million signed up users and 10 million DAU. • Users get 10 GB free space.

  • Assume users upload 2 files per day. The average file size is 500 KB.

  • 1:1 read to write ratio.

  • Total space allocated: 50 million * 10 GB = 500 Petabyte

  • QPS for upload API: 10 million * 2 uploads / 24 hours / 3600 seconds = ~ 240

  • Peak QPS = QPS * 2 = 480

Geo location

Yelp

  • Yelp has more than 178 million unique visitors monthly across mobile, desktop and app platforms

Uber

  • 103 million MAU

  • Uber has 5 million drivers, Q4 2019 and 18.7 million trips per day on average Q1 2020

    • versus Lyft has 2 million drivers, who serve over 21.2 million active riders per quarter

References

  • 分布式服务架构 原理、设计与实战

serving 100% of our video, over 125 million hours every day, to 100 million members across the globe!

For each episode of the crown, over 1200 files will be created.

Every second:

(2009) 1 billion videws per day. That’s at least 11,574 views per second, 694,444 views per minute, and 41,666,667 views per hour.

Reference: .

Reference: .

.

Reference:

Reference:

Architecture tradeoff analysis
Review Rubrics
Soft skills
Hard skills
Non-functional requirements (NFRs)
Availability
Decision chart
COGS
Commodity hardware
Capacity planning
1. Get a baseline: MAU and DAU
2. Growth speed
3. Divide capacity by system capability
Appendix: Conversions
Performance estimation
Memory
Disk IO
Network latency
Typical API latency
Load balancing design
Stress testing tools
Scale numbers with examples
Typeahead service
Google search
Instant messaging app
Microsoft Teams
Whatsapp
Video Streaming
Netflix
Youtube
Newsfeed
Twitter
Facebook
Photo sharing
Instagram
File system
Dropbox
Geo location
Yelp
Uber
References
Failure trends in a large disk drive population
https://www.brentozar.com/archive/2014/12/commodity-hardware/#:~:text=Commodity hardware refers to cheap,E5%2D2600 v3 CPU sockets
https://medium.com/sequoia-capital/selecting-the-right-user-metric-de95015aa38
https://colin-scott.github.io/personal_website/research/interactive_latency.html
https://www.oberlo.com/blog/google-search-statistics
https://everysecond.io/messenger
https://www.businessofapps.com/data/netflix-statistics/
https://netflixtechblog.com/how-data-science-helps-power-worldwide-delivery-of-netflix-content-bac55800f9a7
https://netflixtechblog.com/content-popularity-for-open-connect-b86d56f613b
https://everysecond.io/youtube
https://mashable.com/2009/10/09/youtube-billion-views/
https://www.oberlo.com/blog/youtube-statistics#:~:text=500 hours of video are,uploaded every day to YouTube
https://www.brandwatch.com/blog/twitter-stats-and-statistics/#:~:text=Twitter user statistics,billion accounts have been created.&text=As of Q1 2019%2C 68m,access the site via mobile
https://everysecond.io/instagram
https://www.statista.com/topics/1882/instagram/#:~:text=As of June 2018%2C the,market based on audience size
Dropbox statistics
https://review42.com/resources/yelp-statistics/
https://everysecond.io/uber
Availability numbers