CountMinSketch

Count-min sketch

Pros:
- Efficient in space. Elimiate the need for hashtable key. And value is limited to a fixed length.
Cons: Hash conflicts and an approximate upper bound

Pros: Min of the hash count reduces the possible deviation from actual value.
Cons: Adding more hash functions in a 1-d array actually increases the chance of conflicts.

Database engines plan how they execute queries. How quickly a query is performed can heavily depend on the execution strategy, so it is a crucial area of optimization. For example, this is especially important when determining the order in which several joins are performed, a task known as join order optimization.
Part of finding good execution strategies is estimating the table sizes yielded by certain subqueries. For example, given a join, such as the one below, we want to find out how many rows the result will have.
This information can then be used to allocate a sufficient amount of space. More importantly, in a bigger query where the result is joined with a table c, it could be used to determine which tables to join first.
To estimate the size of the join, we can create two CM sketches. One holds the frequencies of elements x in a, the other holds frequencies of elements x in b. We can then query these sketches to estimate how many rows the result will have.
Building up full hash tables for this task would require a huge amount of space. Using a sketch data structure is much more feasible, especially since the SQL tables in the join could potentially be very big. Furthermore, an approximate result is generally good enough for planning.

SELECT *
FROM a, b
WHERE a.x = b.x

A common task in many analytics application is finding heavy hitters, elements that appear a lot of times. For example, given a huge log of website visits, we might want to determine the most popular pages and how often they were visited. Again, building up a full hash table could scale badly if there is a long tail of unpopular pages that have few visits.
To solve the problem with CMS, we simply iterate through the log once and build up the sketch [2] . To query the sketch, we need to come up with candidate keys to check for. If we do not have an existing candidate set, we can simple go through the log again and look up each page in the CMS, remembering the most important ones.

Last updated 3 years ago

Was this helpful?