Lambda architecture

SuperWebAnalytics.com requiements

Functional requirements

  • Pageview counts by URL sliced by time—Example queries are “What are the pageviews for each day over the past year?” and “How many pageviews have there been in the past 12 hours?”

  • Unique visitors by URL sliced by time—Example queries are “How many unique people visited this domain in 2010?” and “How many unique people visited this domain each hour for the past three days?”

  • Bounce-rate analysis—“What percentage of people visit the page without visiting any other pages on this website?”

NonFunctional requirements

  • Real time metrics

Overview

References

  • Book "Big Data: Principles and best practices of scalable and real-time data systems". Nathan Marz, James Warren

Batch

Data model

Storage requirements

DFS (Distributed file systems)

Partition

Pail on top of DFS

Recomputation vs incremental algorithm

Workflow overview

Time bucket

Flowchart

Url normalization

User id normalization

PageView query

Unique visitors

Bounce rate

Lambda speed

Requirements

  • Random reads—A realtime view should support fast random reads to answer queries quickly. This means the data it contains must be indexed.

  • Random writes—To support incremental algorithms, it must also be possible to modify a realtime view with low latency.

  • Scalability—As with the serving layer views, the realtime views should scale with the amount of data they store and the read/write rates required by the application. Typically this implies that realtime views can be distributed across many machines.

  • Fault tolerance—If a disk or a machine crashes, a realtime view should continue to function normally. Fault tolerance is accomplished by replicating data across machines so there are backups should a single machine fail.

Asynchronous design

Page view

Lambda serving

Def

  • The serving layer consists of databases that index and serve the results of the batch layer

Requirements

  • Batch writable—The batch views for a serving layer are produced from scratch. When a new version of a view becomes available, it must be possible to completely swap out the older version with the updated view.

  • Scalable—A serving layer database must be capable of handling views of arbitrary size. As with the distributed filesystems and batch computation framework previously discussed, this requires it to be distributed across multiple machines.

  • Random reads—A serving layer database must support random reads, with indexes providing direct access to small portions of the view. This requirement is necessary to have low latency on queries.

  • Fault-tolerant—Because a serving layer database is distributed, it must be tolerant of machine failures.

Index design

Pageview

Unique visitors

Bounce rate

Last updated