Python Scrapy framework
Last updated
Last updated
Language comparison for crawler:
Java: Too heavy, not easy to refactor while crawler change might need to change regularly
PHP: Not good support for asynchronous, multi-threading,
C/C++: High effort in development
Python: Winner. Rich in html parser and httprequest. Have modules such as Scrapy, Redis-Scrapy
Middleware:
Extractor middleware: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#topics-spider-middleware
https://leetcode.com/discuss/interview-question/124657/Design-a-distributed-web-crawler-that-will-crawl-all-the-pages-of-wikipedia/263401