Distributed crawler
Functional requirements
Crawler scope
Crawl a specific website? Or entire internet for usage of a search engine
Supported protocol
What protocols we support: HTTP/HTTPS/FTP
Store HTML pages only? Or need other types of media such as images and videos. Need to store historical webpages or only the latest webpages?
Ajax page crawling
Want to crawl dynamic pages containing Ajax pages? Or static pages will be enough?
For content loaded by Ajax, you will not be able to see it when viewing the source page, but you will be able to see it by browser element inspector. As a result, crawler will need to have customized way to access these contents.
Manually mimic a request by copying relevant fields (cookie, user-agent, origin, etc.).
Error prone and requires lots of expertise
Use selenium to click the load button and crawl pages
Auth manager
If the target website requires logging in, then customized accounts are needed for crawling. When there is a large number of content to be crawled, hundreds or thousands of accounts need to be managed because there will be rate limiting on a single account.
Design a cookie pool, typical problems include:
When to detect that the pool size is not big enough?
How to manage different types of cookie for different websites
How to know cookie gets expired?
Unified way for configuration
Verification code
Want to handle verification code?
Optical character recognition mechanism
How to handle sliding verification code
Anti crawler functionalities
Non-functional requirements
Efficiency
Prioritization: Crawl high-importance webpages first. Given that a significant fraction of all web pages are of poor utility for serving user query needs, the crawler should be biased towards fetching “useful” pages first.
Avoid duplication: Crawling webpages which have same or extremely similar web content.
Robust
Avoid deadlocks: The Web contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.
Scalability
Could crawl more content by simply adding machines
Performance
Last updated