Typeahead

Scenario

  • Google suggestion

    • Prefix -> top n hot key words

    • DAU: 500M

    • Search: 6_6_500M = 18b (Every one search for 6 words, each word has 6 characters)

    • QPS = 18b / 86400 ~ 200k

    • Peak QPS = QPS * 2 ~ 400k

  • Twitter typeahead

Initial design

  • Query service

    • Each time a user types a character, the entire prefix is sent to query service.

  • Data collection service

Storage

Query service DB

Word count table

  • How to query on the db

  • Query SQL: Select * from hit_stats where keyword like ${key}% order by hitCount DESC Limit 10

    • Like operation is expensive. It is a range query.

    • where keyword like 'abc%' is equivalent to where keyword >= 'abc' AND keyword < 'abd'

keyword
hitCount

Amazon

20b

Apple

15b

Adidas

7b

Airbnb

3b

Prefix table

  • Convert a keyword table to a prefix table, put into memory

prefix
keywords

a

"amazon","apple"

am

"amazon","amc"

ad

"adidas","adobe"

don

"don't have", "donald trump"

Trie

  • Trie ( in memory ) + Serialized Trie ( on disk ).

    • Trie is must faster than DB because

      • All in-memory vs DB cache miss

  • Store word count at node, but it's slow

    • e.g. TopK. Always need to traverse the entire trie. Exponential complexity.

  • Instead, we can store the top n hot key words and their frequencies at each node, search becomes O(len).

prefix
keywords

a

"amazon","apple"

am

"amazon","amc"

ad

"adidas","adobe"

don

"don't have", "donald trump"

  • How do we add a new record {abd: 3b} to the trie

    • Insert the record into all nodes along its path in the trie.

    • If a node along the path is already full, then need to loop through all records inside the node and compared with the node to be inserted.

Data collections service

  • How frequently do you aggregate data

    • Real-time not impractical. Read QPS 200K + Write QPS 200K. Will slow down query service.

    • Once per week. Each week data collection service will fetch all the data within the most recent one week and aggregate them.

  • How does data collection service update query service? Offline update and works online.

    • All in-memory trie must have already been serialized. Read QPS already really high. Do not write to in-memory trie directly.

    • Use another machine. Data collection service updates query service.

Scale

How to reduce response time

  • Cache result

    • Front-end browser cache the results

  • Pre-fetch

    • Fetch the latest 1000 results

What if the trie too large for one machine

  • Use consistent hashing to decide which machine a particular string belongs to.

    • A record can exist only in one machine. Sharding according to char will not distribute the resource evenly. Instead, calculate consistent hashing code

    • a, am, ama, amax stored in different machines.

How to reduce the size of log file

  • Probablistic logging.

    • Too slow to calculate and too large amount of data to store.

    • Log with 1/10,000 probability

      • Say over the past two weeks "amazon" was searched 1 billion times, with 1/1000 probability we will only log 1 million times.

      • For a term that's searched 1000 times, we might end up logging only once or even zero times.

Real world

Last updated