Beam架构

History

MapReduce

Initial effort for a fault tolerant system for large data processing such as Google URL visiting, inverted index
Cons:
- All intermediate results of Map and Reduce need to be persisted on disk and are time-consuming.
- Whether the problem could be solved in memory in a much more efficient way.

FlumeJava and Millwheel

Improvements:
- Abstract all data into structure such as PCollection.
- Abstract four primitive operations:
  - parallelDo / groupByKey / combineValues and flatten
- Uses deferred evaluation to form a DAG and optimize the planning.
Cons:
- FlumeJava only supports batch processing
- Millwheel only supports stream processing

Dataflow and Cloud Dataflow

Improvements:
- A unifid model for batch and stream processing
- Use a set of standardized API to process data
Cons:
- Only run on top of Google cloud

Apache Beam (Batch + Streaming)

Improvements:
- Become a full open source platform
- Apache beam support different runners such as Spark/Flink/etc.

Example arch with Amazon best sellers

Using beam API

// Count frequency of selling
salesCount = salesRecords.apply(Count.perElement())

// Count the top K elements
PCollection<KV<String, Long>> topK =
      salesCount.apply(Top.of(K, new Comparator<KV<String, Long>>() {
          @Override
          public int compare(KV<String, Long> a, KV<String, Long> b) {
            return b.getValue.compareTo(a.getValue());
          }
      }));

Serving strategy for best sellers

Dedicated database

Save topK hot selling data in a separate database.
Cons:
- When serving queries, need to join with primary database table.

Save back to original database with products

Have a separate column for hot selling products
Cons:
- Need to update large amounts of databse records after each update.

Update best sellers per hour

Run a cron job according to the frequency.

Product states

Handle returned products by consumers
- For each order, there should be an attribute "isSuccessfulSale()" specifying its state (e.g. sold, returned, etc).
Some best sellers which has been delisted
- Similar to the above, there should be attribute "isInStock()"

Same products

Duplicated products
- For each product, there is a product_id. And correspondingly, there will be a pipeline creating product_unique_id from products info such as description, image, etc.
A product receives bad rating. Seller delists and lists them again.
- Similar to the above

Generate best sellers per category

Categorize products according to their tags

Previousstorm架构 NextInstagram-[TODO]

Last updated 3 years ago

Was this helpful?