Beam架构
Last updated
Was this helpful?
Last updated
Was this helpful?
Initial effort for a fault tolerant system for large data processing such as Google URL visiting, inverted index
Cons:
All intermediate results of Map and Reduce need to be persisted on disk and are time-consuming.
Whether the problem could be solved in memory in a much more efficient way.
Improvements:
Abstract all data into structure such as PCollection.
Abstract four primitive operations:
parallelDo / groupByKey / combineValues and flatten
Uses deferred evaluation to form a DAG and optimize the planning.
Cons:
FlumeJava only supports batch processing
Millwheel only supports stream processing
Improvements:
A unifid model for batch and stream processing
Use a set of standardized API to process data
Cons:
Only run on top of Google cloud
Improvements:
Become a full open source platform
Apache beam support different runners such as Spark/Flink/etc.
Save topK hot selling data in a separate database.
Cons:
When serving queries, need to join with primary database table.
Have a separate column for hot selling products
Cons:
Need to update large amounts of databse records after each update.
Run a cron job according to the frequency.
Handle returned products by consumers
For each order, there should be an attribute "isSuccessfulSale()" specifying its state (e.g. sold, returned, etc).
Some best sellers which has been delisted
Similar to the above, there should be attribute "isInStock()"
Duplicated products
For each product, there is a product_id. And correspondingly, there will be a pipeline creating product_unique_id from products info such as description, image, etc.
A product receives bad rating. Seller delists and lists them again.
Similar to the above
Categorize products according to their tags