Shortening mechanisms
Last updated
Was this helpful?
Last updated
Was this helpful?
Retention period varies with base radix. For example, assume 500M new records per month
If length == 8, 62^8 ~ 200 trillion ~ 33333 years
If length == 7, 62^7 ~ 3 trillion ~ 600 years
If length == 6, 62^6 ~ 57 B ~ 10 years
Year
36,500,000
36,500,000
Usable characters
[0-9]
[0-9a-zA-Z]
[0-9a-zA-Z+/=]
Encoding length
8
5
6 bits will be enough for purely storage purpose. However, to avoid too many hash collision, we could use 7 bits.
Comparison
Crypto hash function: MD5 and SHA-1
Secure but slow
Fast hash function: Murmur and Jenkins
Performance
Have 32-, 64-, and 128-bit variants available
Cons
Key duplication if only use 6 prefix characters.
Possible solution: Use long_url + timestamp as hash argument, if conflict happens, try again
Hashing algorithm could be slow.
Pros
No need to write additional hash function, easy to implement
Are randomly distributed
Not guessable and reverserable
Pros
Faster than SHA-256/Murmur algorithms
Cons
Key duplication if only use 6 prefix characters.
Possible solution: Use long_url + timestamp as hash argument, if conflict happens, try again
Easy to predict and guessable
Modified base64 encoding
The 62nd and 63rd character is '+' and '/'.
Inside URL, '+' and '/' be encoded as '%2B' and '%2F'.
So the revised base64 encoding will use '-' instead of '+' and use '_' instead of '/'.
For example, the self-incrementing id could be implemented by GLOBAL_ID inside database.
Cons: Predictable if not implemented carefully.
Sample implementation for an id "123"
Pad '0' before the id, becomes "0000123"
Reverse it, becomes "3210000"
Encode "3210000" using base62
When generate on the fly, there will be performance bottlenecks if there are collisions after truncating characters.
If the hashing values could be calculated ahead of time, the generated URLs could be stored in the file system.