Beam架构
History
MapReduce
Initial effort for a fault tolerant system for large data processing such as Google URL visiting, inverted index
Cons:
All intermediate results of Map and Reduce need to be persisted on disk and are time-consuming.
Whether the problem could be solved in memory in a much more efficient way.
FlumeJava and Millwheel
Improvements:
Abstract all data into structure such as PCollection.
Abstract four primitive operations:
parallelDo / groupByKey / combineValues and flatten
Uses deferred evaluation to form a DAG and optimize the planning.
Cons:
FlumeJava only supports batch processing
Millwheel only supports stream processing
Dataflow and Cloud Dataflow
Improvements:
A unifid model for batch and stream processing
Use a set of standardized API to process data
Cons:
Only run on top of Google cloud
Apache Beam (Batch + Streaming)
Improvements:
Become a full open source platform
Apache beam support different runners such as Spark/Flink/etc.
Example arch with Amazon best sellers
Using beam API
Serving strategy for best sellers
Dedicated database
Save topK hot selling data in a separate database.
Cons:
When serving queries, need to join with primary database table.
Save back to original database with products
Have a separate column for hot selling products
Cons:
Need to update large amounts of databse records after each update.
Update best sellers per hour
Run a cron job according to the frequency.
Product states
Handle returned products by consumers
For each order, there should be an attribute "isSuccessfulSale()" specifying its state (e.g. sold, returned, etc).
Some best sellers which has been delisted
Similar to the above, there should be attribute "isInStock()"
Same products
Duplicated products
For each product, there is a product_id. And correspondingly, there will be a pipeline creating product_unique_id from products info such as description, image, etc.
A product receives bad rating. Seller delists and lists them again.
Similar to the above
Generate best sellers per category
Categorize products according to their tags
Last updated