ArchitectureTradeOffAnalysis

Architecture tradeoff analysis

Review Rubrics

Soft skills

  • Requirements gathering

  • Make decisions and tradeoffs with justification

  • Describe the solution using concise language and accurate technical terms

Hard skills

  • Design quality; scalability; reliability, efficiency etc (L4/L5)

  • Basic facts about existing software and hardware capabilities (L4 partly, L5)

  • Project lifecycle awareness, e.g. How a project is developed and maintained (L5)

Non-functional requirements (NFRs)

TypeDescription

Performance

Efficiency such as throughput and response time

Availability

Uptime percentage in a year

Scalability

As number of nodes increases, service capability increases linearly

Extensibility

Pluggable and easiness to add new functionalities

Security

Privacy and security

Observability

Able to detect problems and get root cause quickly

Testability

Easy to test different componentss

Robustness

Fault tolerance and fast recovery, high robustness usually indicates high availability

Portability / Compatibility

Support for different OS, hardwares, softwares (browsers, etc) and versions

Consistency

Support for different OS, hardwares, softwares (browsers, etc) and versions

Availability

Availability percentage and service downtime

Commodity hardware failure trend

  • If your system has 4-5 systems and dozens of database servers (around 10) on the critical path, and assume the failure rate as 2%, then each year you will encounter twice disk failure scenarios.

Decision chart

  • [TODO: Decison chart]

COGS

Commodity hardware

Capacity planning

1. Get a baseline: MAU and DAU

  • The benchmarks above show the average stickiness of products for various industries. It is calculated as (DAU/MAU)*100. The chart also mentions the median along with the average because medians are less likely to be skewed by outliers.

  • For the SaaS industry, the average stickiness is 13% which means slightly less than 4 days of activity/month/user. The Median for the SaaS industry is 9.4%, implying less than 3 days of activity/per user per month.

  • Multiply DAU/WAU * WAU / MAU to get actual DAU/MAU ratio:

    • Facebook: ~72%

    • Ecommerce:

      • Amazon: 17%

      • Walmart: 15%

      • eBay: 3%

    • Finance:

      • Paypal: 12.5%

      • Venmo: 10%

    • Uber: 12.5%

    • Netflix: 3%

    • Groupon: 4.5%

2. Growth speed

  • For fast growing data (e.g. order data in ecommerce website), use 2X planned capacity to avoid resharding

  • For slow growing data (e.g. user identity data in ecommerce website), use 3-year estimated capacity to avoid resharding.

3. Divide capacity by system capability

Single Kafka instance

  • Single machine write: 250K (50MB) messages per second

  • Single machine read: 550K (110MB) messages per second

Appendix: Conversions

Power of two

Power of two10 based numberShort name

10

1 thousand (10^3)

1 KB

20

1 million (10^6)

1 MB

30

1 billion (10^9)

1 GB

40

1 trillion (10^12)

1 TB

50

1 quadrillion (10^15)

1 PB

Time scale conversion

  • Total seconds in a day: 86400 ~ 10^5

  • 2.5 million requests per month: 1 request per second

  • 100 million requests per month: 40 requests per second

  • 1 billion requests per month: 400 requests per second

Performance estimation

Memory

  • Random access: 300K times / s

  • Sequential access: 5M times / s

  • Size: GB level per second

  • Read 1MB memory data takes 0.25ms

Disk IO

  • Operating system page size for read and write: 4KB

  • SATA mechanical hard disk

    • IOPS: 120 times / s

    • Sequential read size: 100MB / s

    • Random read size: 2MB / s

    • Sector size: 0.5KB

  • SSD hard disk: Speed similar to memory

    • 0.1-0.2ms

    • Sector size: 4KB

Network latency

Typical API latency

  • [TODO: Add a section for typical API latency]

Load balancing design

  • Example: Design load balancing mechanism for an application with 10M DAU (e.g. Github has around 10M DAU)

  • Traffic voluem estimation

  • 10M DAU. Suppose each user operate 10 times a day. Then the QPS will be roughly ~ 1160 QPS

  • Peak value 10 times average traffic ~ 11600 QPS

  • Suppose volume need to increase due to static resource, microservices. Suppose 10. QPS ~ 116000 QPS.

  • Capacity planning

  • Multiple DC: QPS * 2 = 232000

  • Half-year volume increase: QPS * 1.5 = 348000

  • Mechanism

  • No DNS layer

  • LVS

Stress testing tools

  • MySqlslap: Shipped together with MySQL. Could not perform long time stress test.

  • Sysbench: Works on MacOS and Linux.

  • JMeter: Only basic functionality for database pressure testing.

Scale numbers with examples

Typeahead service

  • Google has been visited 62.19 billion times this year.

  • Google processes over 3.5 billion searches per day.

    • It means that Google processes over 40,000 search queries every second on average. Let’s also take a look at how Google’s searches per year have progressed. In 1998, Google was processing over 10,000 search queries per day. In comparison, by the end of 2006, the same amount of searches would be processed by Google in a single second.

  • 84 percent of respondents use Google 3+ times a day or more often.

    • Google has 92.18 percent of the market share as of July 2019.

  • More than one billion questions have been asked on Google Lens.

  • 63 percent of Google’s US organic search traffic originated from mobile devices.

  • Facebook was the most searched keyword on Google.

  • 46 percent of product searches begin on Google.

  • 90 percent of survey respondents said they were likely to click on the first set of results.

Instant messaging app

  • Whatsapp: 1.6 billion MAU

  • Facebook Messenger: 1.3 billion MAU

  • Wechat: 1.1 billion MAU

  • Snapchat: 0.3 billion MAU

  • Telegram: 0.2 billion MAU

Microsoft Teams

  • 140 million DAU

  • 240 million MAU

Whatsapp

  • 1.6 billion WhatsApp users access the app on a monthly basis. 53 percent of WhatsApp users in the US use the app at least once a day.

  • More than 65 billion messages are sent via WhatsApp every day. In other words, that boils down to 2.7 billion per hour, 45 million per minute, and more than 750,000 per second.

  • WhatsApp was downloaded 96 million times in February 2020.

  • WhatsApp is available in more than 180 countries and 60 different languages.

  • With 340 million users, India is WhatsApp’s biggest market.

  • There are more than five million businesses using WhatsApp Business.

Video Streaming

Netflix

// Watch video RPS
100 M daily active users * 2 hours per day spent by each subscriber / total seconds

Youtube

Newsfeed

Twitter

Facebook

Photo sharing

Instagram

File system

Dropbox

  • Assume the application has 50 million signed up users and 10 million DAU. • Users get 10 GB free space.

  • Assume users upload 2 files per day. The average file size is 500 KB.

  • 1:1 read to write ratio.

  • Total space allocated: 50 million * 10 GB = 500 Petabyte

  • QPS for upload API: 10 million * 2 uploads / 24 hours / 3600 seconds = ~ 240

  • Peak QPS = QPS * 2 = 480

Geo location

Yelp

Uber

  • 103 million MAU

  • Uber has 5 million drivers, Q4 2019 and 18.7 million trips per day on average Q1 2020

    • versus Lyft has 2 million drivers, who serve over 21.2 million active riders per quarter

References

  • 分布式服务架构 原理、设计与实战

Last updated