Estimation
Target
There are 1 billion new pages per month.
Each website has 100 links.
Need to retain data for 5 years.
Throughput estimation
// How many web pages to fetch per second
10^9 website * 100 links per website / (4 weeks * 7 days * 86400 sec)
~= 10^11 / (10 * 10^5)
~= 10^5 webpages /sec
Storage estimation
Suppose each webpage needs to be stored for 5 years.
Page sizes vary a lot, but if we will be dealing with HTML text only, let’s assume an average page size of 100KB.
Total copy of data to store: 5 year retention
Needed storage = Full capacity: 0.7
10^9 website * 100 links per page * 100KB * 5 year retention period / capacity ratio
~= 10^9 * 10^2 * 10^5 * 5 / 0.7
~= 10^16 * 0.5
~= 5 Petabytes
Needs to be distributed?
Suppose that a single machine could use 16 threads for web crawling and crawling each web page takes 0.1s.
A single machine could crawl up to 160 pages per second. Then it takes 2 * 10^9 / (160 * 86400) = 154 days.
So the cralwer needs to be distributed.
Last updated
Was this helpful?