Flowchart

Naive short url web service

Offline job generates keys (Daemon process) and stores into database.
- Offline job can tolerate longer query time.

Keys database query
- 1. Select key from keys where status = 0 limit 1
- 1. Update keys set status = 1 where key = a_key

Race condition

Multiple users can get same keys if qps is high
- Select / Update is not atomic operation
Database is not suitable for queuing
Solution:
1. Database lock: Select key from keys where status = 0 limit 1 for update skip locked
2. Distributed lock: bad performance.
3. Redis -> LPush and LPop: Redis list data structure is actually queue and it is thread-safe and production ready

After short url is generated, it is stored inside HDFS files consecutively.
1 billion urls per month means 24 billion records in two years and in total 24 * 10 ^ 9 * 6 = 144 GB.
It is stored as ASC code consecutively inside the file.

When preload service starts, it needs to preload 10K urls from HDFS.
Then it just need to read 60K bytes data from HDFS, and store the offset value of 60K.
- Typically this process will take 20-50ms.
Next time it loads another 60K bytes data, starting from the offset value.
The loaded 10K short urls will be stored as linkedlist inside Url short
Upon request, the url shortener service will get a new entry from the preload service.
- This step has lock and it could be resource exclusive.
When there is fewer than 2000 new urls inside url preload service, it will load another 10K short urls from HDFS.

Using linkedlist has the following cons:
- Much additional GC cost
- Additional space cost, shortUrl only takes 6 bytes.

Since opening the file and reading from offset is a mutally exclusive operation, it will prevent multiple short url services load simultaneously from HDFS.

Last updated 1 year ago

Was this helpful?