Kafka

History

Idea: Use Linux Cronjob to periodically upload files to HDFS
Downside: No failure tolerance. If a machine has failures, its log will be lost.

Idea: Once log is being written to HDFS, it will has three copies and no lost.
Downsides:
- Concurrent write pressure:
  - Lots of concurrency pressure on HDFS. If there are 100 application servers, then there will be 100 clients writing data to HDFS 24*7.
  - There will be concurrent write competition on HDFS chunck server because it will be queued in chunkserver.
- Overhead:
  - If each server writes their own log file, then there will be large number of small log files created.
  - For HDFS, each block will be at least 64MB and lots of small files will waste huge amount of storage space.
  - For MapReduce, each independent file need a separate map task to read.
In summary, HDFS is only applicable for scenarios where there is a single client writing sequentially in large chunks.

Idea:
- On each server, there will be a log collector. Multiple log collectors could upload their log to log aggregator
- There will be a tree like hierarchy where in the end only few log aggregators are writting data to HDFS.
- It could solve the problem that there will be large number of small files.
Downsides:
- The final aggregator will still produce a new HDFS file on a frequent basis.
- MapReduce tasks cannot assume that final aggregator data is already in place and need to handle the failure scenarios.

Last updated 1 year ago

Was this helpful?