Robustness
Last updated
Was this helpful?
Last updated
Was this helpful?
The scheduler and downloader is an offline system and could be restarted if needed.
However, they need to recover from failure such as timeout or parsing failures. Different types of exceptions need to be taken care of.
In standalone case, scheduler is essentially a priority queue inside memory.
Could use a MySQL DB task table if scheduler queue grows too big
state (working/idle): Whether it is being crawling.
priority (1/0):
available time: frequency. When to fetch the next time.
1
βidleβ
1
β2016-03-04 11:00 amβ
2
βworkingβ
1
β2016-03-04 12:00 amβ
3
βidleβ
0
β2016-03-14 02:00 pmβ
4
βidleβ
2
β2016-03-12 04:25 amβ
From the previous analysis, the write throughput is around 800 RPS. A single scheduler machine will be enough.
β
β
β
β