Kafka
History
Goal
Hadoop MapReduce could only process data stored on HDFS cluster.
Upload logs from servers to HDFS cluster.
Approach V0: Cronjob upload
Idea: Use Linux Cronjob to periodically upload files to HDFS
Downside: No failure tolerance. If a machine has failures, its log will be lost.
Approach V1: Write via HDFS client
Idea: Once log is being written to HDFS, it will has three copies and no lost.
Downsides:
Concurrent write pressure:
Lots of concurrency pressure on HDFS. If there are 100 application servers, then there will be 100 clients writing data to HDFS 24*7.
There will be concurrent write competition on HDFS chunck server because it will be queued in chunkserver.
Overhead:
If each server writes their own log file, then there will be large number of small log files created.
For HDFS, each block will be at least 64MB and lots of small files will waste huge amount of storage space.
For MapReduce, each independent file need a separate map task to read.
In summary, HDFS is only applicable for scenarios where there is a single client writing sequentially in large chunks.
Approach V2: Log collector such as Scribe/Flume
Idea:
On each server, there will be a log collector. Multiple log collectors could upload their log to log aggregator
There will be a tree like hierarchy where in the end only few log aggregators are writting data to HDFS.
It could solve the problem that there will be large number of small files.
Downsides:
The final aggregator will still produce a new HDFS file on a frequent basis.
MapReduce tasks cannot assume that final aggregator data is already in place and need to handle the failure scenarios.
Approach V3: Kafka to rescue
Architecture
References
Last updated