Typeahead
Scenario
Google suggestion
Prefix -> top n hot key words
DAU: 500M
Search: 6_6_500M = 18b (Every one search for 6 words, each word has 6 characters)
QPS = 18b / 86400 ~ 200k
Peak QPS = QPS * 2 ~ 400k
Twitter typeahead
Initial design
Query service
Each time a user types a character, the entire prefix is sent to query service.
Data collection service
Storage
Query service DB
Word count table
How to query on the db
Query SQL: Select * from hit_stats where keyword like ${key}% order by hitCount DESC Limit 10
Like operation is expensive. It is a range query.
where keyword like 'abc%' is equivalent to where keyword >= 'abc' AND keyword < 'abd'
keyword | hitCount |
---|---|
Amazon | 20b |
Apple | 15b |
Adidas | 7b |
Airbnb | 3b |
Prefix table
Convert a keyword table to a prefix table, put into memory
prefix | keywords |
---|---|
a | "amazon","apple" |
am | "amazon","amc" |
ad | "adidas","adobe" |
don | "don't have", "donald trump" |
Trie
Trie ( in memory ) + Serialized Trie ( on disk ).
Trie is must faster than DB because
All in-memory vs DB cache miss
Store word count at node, but it's slow
e.g. TopK. Always need to traverse the entire trie. Exponential complexity.
Instead, we can store the top n hot key words and their frequencies at each node, search becomes O(len).
prefix | keywords |
---|---|
a | "amazon","apple" |
am | "amazon","amc" |
ad | "adidas","adobe" |
don | "don't have", "donald trump" |
How do we add a new record {abd: 3b} to the trie
Insert the record into all nodes along its path in the trie.
If a node along the path is already full, then need to loop through all records inside the node and compared with the node to be inserted.
Data collections service
How frequently do you aggregate data
Real-time not impractical. Read QPS 200K + Write QPS 200K. Will slow down query service.
Once per week. Each week data collection service will fetch all the data within the most recent one week and aggregate them.
How does data collection service update query service? Offline update and works online.
All in-memory trie must have already been serialized. Read QPS already really high. Do not write to in-memory trie directly.
Use another machine. Data collection service updates query service.
Scale
How to reduce response time
Cache result
Front-end browser cache the results
Pre-fetch
Fetch the latest 1000 results
What if the trie too large for one machine
Use consistent hashing to decide which machine a particular string belongs to.
A record can exist only in one machine. Sharding according to char will not distribute the resource evenly. Instead, calculate consistent hashing code
a, am, ama, amax stored in different machines.
How to reduce the size of log file
Probablistic logging.
Too slow to calculate and too large amount of data to store.
Log with 1/10,000 probability
Say over the past two weeks "amazon" was searched 1 billion times, with 1/1000 probability we will only log 1 million times.
For a term that's searched 1000 times, we might end up logging only once or even zero times.
Real world
Last updated