Flowchart
Naive short url web service
Offline job generates keys (Daemon process) and stores into database.
Offline job can tolerate longer query time.
Database schema
create table keys(Key text primary key, status integer)
status
0: available
1: occupied
Query
Keys database query
Select key from keys where status = 0 limit 1
Update keys set status = 1 where key = a_key
Cons
Race condition
Multiple users can get same keys if qps is high
Select / Update is not atomic operation
Database is not suitable for queuing
Solution:
Database lock: Select key from keys where status = 0 limit 1 for update skip locked
Distributed lock: bad performance.
Redis -> LPush and LPop: Redis list data structure is actually queue and it is thread-safe and production ready
Improved approach
Flowchart
HDFS
After short url is generated, it is stored inside HDFS files consecutively.
1 billion urls per month means 24 billion records in two years and in total 24 * 10 ^ 9 * 6 = 144 GB.
It is stored as ASC code consecutively inside the file.
Short url preload service
Steps
When preload service starts, it needs to preload 10K urls from HDFS.
Then it just need to read 60K bytes data from HDFS, and store the offset value of 60K.
Typically this process will take 20-50ms.
Next time it loads another 60K bytes data, starting from the offset value.
The loaded 10K short urls will be stored as linkedlist inside Url short
Upon request, the url shortener service will get a new entry from the preload service.
This step has lock and it could be resource exclusive.
When there is fewer than 2000 new urls inside url preload service, it will load another 10K short urls from HDFS.
Linkedlists vs circular arrays
Using linkedlist has the following cons:
Much additional GC cost
Additional space cost, shortUrl only takes 6 bytes.
Benefits
Since opening the file and reading from offset is a mutally exclusive operation, it will prevent multiple short url services load simultaneously from HDFS.
Last updated