Estimation
Target
There are 1 billion new pages per month.
Each website has 100 links.
Need to retain data for 5 years.
Throughput estimation
Storage estimation
Suppose each webpage needs to be stored for 5 years.
Page sizes vary a lot, but if we will be dealing with HTML text only, letβs assume an average page size of 100KB.
Total copy of data to store: 5 year retention
Needed storage = Full capacity: 0.7
Needs to be distributed?
Suppose that a single machine could use 16 threads for web crawling and crawling each web page takes 0.1s.
A single machine could crawl up to 160 pages per second. Then it takes 2 * 10^9 / (160 * 86400) = 154 days.
So the cralwer needs to be distributed.
Last updated