🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • How to choose sharding key
  • Sharding strategies
  • Lookup strategy
  • Range strategy
  • Hash strategy
  • Typical sharding keys
  • By customer or tenant id
  • By geography
  • By time
  • Sharding middleware formats
  • SDK format
  • Proxy format
  • Cons

Was this helpful?

  1. Traditional DB
  2. Sharding

Sharding strategies

PreviousShardingNextSharding ID generator overview

Last updated 1 year ago

Was this helpful?

How to choose sharding key

  • Avoid cross shard joins

  • Make data distribution even across shards

Sharding strategies

Lookup strategy

  • Pros:

    • Easy to migrate data

  • Cons:

    • Need an additional hop when query

    • If the lookup table is really big, it could also need to be sharded

Range strategy

  • Pros:

    • Easy to add a new shard. No need to move the original data. For example, each month could have a new shard.

  • Cons:

    • Uneven distribution. For example, July is the hot season but December is the cold season.

Hash strategy

  • Shard based on hashing value of a field.

  • Pros:

    • Evenly distributed data

  • Cons:

    • Hard to add a new shard. Lots of data migration need to happen.

Typical sharding keys

By customer or tenant id

  • If it is a SaaS business, it is often true that data from one customer doesn't interact with data from any of your other customers. These apps are usually called multi-tenant apps.

    • Multi-tenant apps usually require strong consistency where transaction is in place and data loss is not possible.

    • Multi-tenant data usually evolves over time to provide more and more functionality. Unlike consumer apps which benefit from network effects to grow, B2B applications grows by adding new features for customers.

By geography

  • Apps such as postmate, lyft or instacart.

  • You’re not going to live in Alabama and order grocery delivery from California. And if you were to order a Lyft pick-up from California to Alabama you’ll be waiting a good little while for your pickup.

By time

  • Time sharding is incredibly common when looking at some form of event data. Event data may include clicks/impressions of ads, it could be network event data, or data from a systems monitoring perspective.

  • This approach should be used when

    • You generate your reporting/alerts by doing analysis on the data with time as one axis.

    • You’re regularly rolling off data so that you have a limited retention of it.

Sharding middleware formats

SDK format

  • It lives as dependency injection format within the code.

Proxy format

  • It works like a independent database.

Cons

  • Low performance when compared with SDK formats. Since all data query is sent to it, it will easily become the bottleneck.

How to choose sharding key
Sharding strategies
Lookup strategy
Range strategy
Hash strategy
Typical sharding keys
By customer or tenant id
By geography
By time
Sharding middleware formats
SDK format
Proxy format
Cons
lookup
range
hash