Follow-up

Problem

  • What if disallowing same long urls mapped to different short urls (similar allow user to define customized urls)

Avoid duplication

  • Option1: Before insert random url into database. Check whether it exists within database, if not then insert. And build index on shortUrl for faster search. However, index building could be an expensive operation.

  • Option2: Use bloomfilter to check whether a url already exists within the dictionary.

  • Option3: Client could cache the short url after the first request.

  • Option4: Put longUrl => shortUrl map in the cache for several days. This could make sure that the same short url will be returned during this time.

Sharding solutions

  • Additional considerations: Not only short-to-long mapping needs to be stored. The long-to-short mapping needs to be stored as well.

  • This will create additional complexity when choosing sharding key

Naive sharding key

  • Use Long Url as sharding key

    • Short to long operation will require lots of cross-shard joins

  • Use Short Url as sharding key

    • Short to long url: Find database according to short url; Find long url in the corresponding database

    • Long to short url: Broadcast to N databases to see whether the link exist before. If not, get the next ID and insert into database.

Combine short and long Url as sharding key

Idea

  • Problem: For "31bJF4", how do we find which physical machine the code locate at?

  • Our query is

    • Select original_url from tiny_url where short_url = XXXXX;

  • The simple one

    • We query all nodes. But there will be n queries

    • This will be pretty expensive, even if we fan out database calls.

    • But the overall QPS to database will be n times QPS

  • Solution

    • Add virtual node number to prefix "31bJF4" => "131bJF4"

    • When querying, we can find virtual nodes from prefix (the tinyUrl length is 6). Then we can directly query node1.

// Assign virtual nodes to physical nodes
// Store within database
{
    0: "db0.tiny_url.com",
    1: "db1.tiny_url.com",
    2: "db2.tiny_url.com",
    3: "db3.tiny_url.com"
}

Implementation

  • Hash(longUrl)%62 + shortkey

  • Given shortURL, we can get the sharding machine by the first bit of shortened url.

  • Given longURL, get the sharding machine according to Hash(longURL) % 62. Then take the first bit.

Geo location sharding key

  • Sharding according to the geographical info.

    • First know which websites are more popular in which region. Put all websites popular in US in US DB.

Last updated