Kafka

History

Goal

  • Hadoop MapReduce could only process data stored on HDFS cluster.

  • Upload logs from servers to HDFS cluster.

Approach V0: Cronjob upload

  • Idea: Use Linux Cronjob to periodically upload files to HDFS

  • Downside: No failure tolerance. If a machine has failures, its log will be lost.

Approach V1: Write via HDFS client

  • Idea: Once log is being written to HDFS, it will has three copies and no lost.

  • Downsides:

    • Concurrent write pressure:

      • Lots of concurrency pressure on HDFS. If there are 100 application servers, then there will be 100 clients writing data to HDFS 24*7.

      • There will be concurrent write competition on HDFS chunck server because it will be queued in chunkserver.

    • Overhead:

      • If each server writes their own log file, then there will be large number of small log files created.

      • For HDFS, each block will be at least 64MB and lots of small files will waste huge amount of storage space.

      • For MapReduce, each independent file need a separate map task to read.

  • In summary, HDFS is only applicable for scenarios where there is a single client writing sequentially in large chunks.

Approach V2: Log collector such as Scribe/Flume

  • Idea:

    • On each server, there will be a log collector. Multiple log collectors could upload their log to log aggregator

    • There will be a tree like hierarchy where in the end only few log aggregators are writting data to HDFS.

    • It could solve the problem that there will be large number of small files.

  • Downsides:

    • The final aggregator will still produce a new HDFS file on a frequent basis.

    • MapReduce tasks cannot assume that final aggregator data is already in place and need to handle the failure scenarios.

Approach V3: Kafka to rescue

  • Architecture

References

Last updated