Beam架构

History

MapReduce

  • Initial effort for a fault tolerant system for large data processing such as Google URL visiting, inverted index

  • Cons:

    • All intermediate results of Map and Reduce need to be persisted on disk and are time-consuming.

    • Whether the problem could be solved in memory in a much more efficient way.

FlumeJava and Millwheel

  • Improvements:

    • Abstract all data into structure such as PCollection.

    • Abstract four primitive operations:

      • parallelDo / groupByKey / combineValues and flatten

    • Uses deferred evaluation to form a DAG and optimize the planning.

  • Cons:

    • FlumeJava only supports batch processing

    • Millwheel only supports stream processing

Dataflow and Cloud Dataflow

  • Improvements:

    • A unifid model for batch and stream processing

    • Use a set of standardized API to process data

  • Cons:

    • Only run on top of Google cloud

Apache Beam (Batch + Streaming)

  • Improvements:

    • Become a full open source platform

    • Apache beam support different runners such as Spark/Flink/etc.

Example arch with Amazon best sellers

Using beam API

// Count frequency of selling
salesCount = salesRecords.apply(Count.perElement())

// Count the top K elements
PCollection<KV<String, Long>> topK =
      salesCount.apply(Top.of(K, new Comparator<KV<String, Long>>() {
          @Override
          public int compare(KV<String, Long> a, KV<String, Long> b) {
            return b.getValue.compareTo(a.getValue());
          }
      }));

Serving strategy for best sellers

Dedicated database

  • Save topK hot selling data in a separate database.

  • Cons:

    • When serving queries, need to join with primary database table.

Save back to original database with products

  • Have a separate column for hot selling products

  • Cons:

    • Need to update large amounts of databse records after each update.

Update best sellers per hour

  • Run a cron job according to the frequency.

Product states

  • Handle returned products by consumers

    • For each order, there should be an attribute "isSuccessfulSale()" specifying its state (e.g. sold, returned, etc).

  • Some best sellers which has been delisted

    • Similar to the above, there should be attribute "isInStock()"

Same products

  • Duplicated products

    • For each product, there is a product_id. And correspondingly, there will be a pipeline creating product_unique_id from products info such as description, image, etc.

  • A product receives bad rating. Seller delists and lists them again.

    • Similar to the above

Generate best sellers per category

  • Categorize products according to their tags

Last updated