Robustness

Robustness

No need to be highly available

  • The scheduler and downloader is an offline system and could be restarted if needed.

  • However, they need to recover from failure such as timeout or parsing failures. Different types of exceptions need to be taken care of.

Output the crawling status to disk for recovery

  • In standalone case, scheduler is essentially a priority queue inside memory.

  • Could use a MySQL DB task table if scheduler queue grows too big

    • state (working/idle): Whether it is being crawling.

    • priority (1/0):

    • available time: frequency. When to fetch the next time.

id
url
state
priority
available_time

1

β€œidle”

1

β€œ2016-03-04 11:00 am”

2

β€œworking”

1

β€œ2016-03-04 12:00 am”

3

β€œidle”

0

β€œ2016-03-14 02:00 pm”

4

β€œidle”

2

β€œ2016-03-12 04:25 am”

Num of machines

  • From the previous analysis, the write throughput is around 800 RPS. A single scheduler machine will be enough.

Last updated