Capacity planning

High concurrent writes and large volume across tables: MySQL DB Sharding

Use cases

Last resort

  • Sharding should be used as a last resort after you exhausted the following:

    • Add cache

    • Add read-write separation

    • Consider table partition

Triggers

  • Only use in OLTP cases (OLAP is more likely to have complex changing SQL queries)

  • A single table's capacity reaches 2GB.

  • A database should not contain more than 1,000 tables.

  • Each individual table should not exceed 1 GB in size or 20 million rows;

  • The total size of all the tables in a database should not exceed 2 GB.

Capacity planning

  • For fast growing data (e.g. order data in ecommerce website), use 2X planned capacity to avoid resharding

  • For slow growing data (e.g. user identity data in ecommerce website), use 3-year estimated capacity to avoid resharding.

Introduced problems

How to choose sharding key

  • Avoid cross shard joins

  • Make data distribution even across shards

Choose the number of shards

  • If has a cap on storage:

    • Each shard could contain at most 1TB data.

    • number of shards = total storage / 1TB

  • If has a cap on number of records:

    • Suppose the size of row is 100 bytes

      • User table: uid (long 8 bytes), name (fixed char 16 bytes), city (int 4 bytes), timestamp (long 8 bytes), sex (int 4 bytes), age (int 4 bytes) = total 40 bytes

    • Total size of the rows: 100 bytes * Number_of_records

    • number of shards = total size of rows / 1TB

Limited SQL queries

Sharding stratgies

Lookup strategy

  • Pros:

    • Easy to migrate data

  • Cons:

    • Need an additional hop when query

    • If the lookup table is really big, it could also need to be sharded

Range strategy

  • Pros:

    • Easy to add a new shard. No need to move the original data. For example, each month could have a new shard.

  • Cons:

    • Uneven distribution. For example, July is the hot season but December is the cold season.

By customer or tenant

  • If it is a SaaS business, it is often true that data from one customer doesn't interact with data from any of your other customers. These apps are usually called multi-tenant apps.

    • Multi-tenant apps usually require strong consistency where transaction is in place and data loss is not possible.

    • Multi-tenant data usually evolves over time to provide more and more functionality. Unlike consumer apps which benefit from network effects to grow, B2B applications grows by adding new features for customers.

By geography

  • Apps such as postmate, lyft or instacart.

  • You’re not going to live in Alabama and order grocery delivery from California. And if you were to order a Lyft pick-up from California to Alabama you’ll be waiting a good little while for your pickup.

By time

  • Time sharding is incredibly common when looking at some form of event data. Event data may include clicks/impressions of ads, it could be network event data, or data from a systems monitoring perspective.

  • This approach should be used when

    • You generate your reporting/alerts by doing analysis on the data with time as one axis.

    • You’re regularly rolling off data so that you have a limited retention of it.

Hash strategy

By entity id

  • Shard based on hashing value of a field.

  • Pros:

    • Evenly distributed data

  • Cons:

    • Hard to add a new shard. Lots of data migration need to happen.

How to store unsharded table

  • Store a copy within each shard

    • Cons: Write across to guarantee consistency

  • Use a centralized node to store

    • Cons: Need to have cross shard joins

How to deploy shards on nodes

  • Each shard gets stored in a separate database, and each database gets stored in a separate node.

  • Each shard gets stored in a separate database, and multiple database gets stored in a single node.

Cross shard join

  • Query types:

    • Join queries:

    • count queries:

    • order by queries:

  • Solutions:

    • Aggregate query result for different shard within application code.

    • Usually use two sets of data to solve the problem

      • One data is based on unique sharding key.

      • The other one is data replicated asynchronously to Elasticsearch or Solr.

Distributed transactions (write across shards)

  • Original transaction needs to be conducted within a distributed transaction.

    • e.g. ecommerce example (order table and inventory table)

  • There are wwo ways in general to implement distributed transactions:

    • 2PC

    • TCC

  • For example, some software has built-in implementations such as:

    • MySQL XA

    • Spring JTA

Unique global ID

Challenges in Graph DB sharding

  • Graph model is most common in B2C apps like Facebook and Instagram.

  • With this model, data is often replicated in a few different forms. Then it is the responsibility of the application to map to the form that is most useful to acquire the data. The result is you have multiple copies for your data sharded in different ways, eventual consistency of data typically, and then have some application logic you have to map to your sharding strategy. For apps like Facebook and Reddit there is little choice but to take this approach, but it does come at some price.

ShardingSphere

  • ShardingSphere has three solutions: ShardingJDBC / ShardingProxy / ShardingSphere

Sharding JDBC

Sharding Proxy**

Sharding example (In Chinese)

  • Original table

  • Sharded result

    • Vertical sharding: Store, product and Sku should be stored in three different databases.

    • Fast growing table: Among all three database, Sku table will grow much faster than product and store.

    • Binding table: Sku and SkuInfo always appear together. Product and productType usually appear together. They should be sharded according to the same column.

Last updated