Kafka
History
Goal
- Hadoop MapReduce could only process data stored on HDFS cluster. 
- Upload logs from servers to HDFS cluster. 
Approach V0: Cronjob upload
- Idea: Use Linux Cronjob to periodically upload files to HDFS 
- Downside: No failure tolerance. If a machine has failures, its log will be lost. 
Approach V1: Write via HDFS client
- Idea: Once log is being written to HDFS, it will has three copies and no lost. 
- Downsides: - Concurrent write pressure: - Lots of concurrency pressure on HDFS. If there are 100 application servers, then there will be 100 clients writing data to HDFS 24*7. 
- There will be concurrent write competition on HDFS chunck server because it will be queued in chunkserver. 
 
- Overhead: - If each server writes their own log file, then there will be large number of small log files created. 
- For HDFS, each block will be at least 64MB and lots of small files will waste huge amount of storage space. 
- For MapReduce, each independent file need a separate map task to read. 
 
 
- In summary, HDFS is only applicable for scenarios where there is a single client writing sequentially in large chunks. 
Approach V2: Log collector such as Scribe/Flume
- Idea: - On each server, there will be a log collector. Multiple log collectors could upload their log to log aggregator 
- There will be a tree like hierarchy where in the end only few log aggregators are writting data to HDFS. 
- It could solve the problem that there will be large number of small files. 
 
- Downsides: - The final aggregator will still produce a new HDFS file on a frequent basis. 
- MapReduce tasks cannot assume that final aggregator data is already in place and need to handle the failure scenarios. 
 
Approach V3: Kafka to rescue
- Architecture 

References
Last updated
Was this helpful?