🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • Cluster and unclustered index
  • Clustered index structure
  • Unclustered / Secondary index structure
  • Capacity estimation
  • Unique index
  • Covering index
  • Example
  • Composite index
  • Def
  • Why not multiple secondary index
  • Left-prefix index rule
  • Hash index for == and IN
  • Adaptive hash index
  • Queries not using index
  • References

Was this helpful?

  1. Traditional DB

Index categories

PreviousIndex data structureNextLock

Last updated 1 year ago

Was this helpful?

Cluster and unclustered index

Clustered index

Unclustered index

Number

Since a clustered index impacts the physical organization of the data, there can be only one clustered index per table.

Since an unclustered index only points to data location, at least two operations need to be performed for accessing data.

Example

mySQL innoDB primary key index is a clustered index (both index and data inside *.idb file)

mySQL myISAM index is an unclustered index (index inside .myi file and data inside \.myd file)

Perf

For read queries: Clustered index will typically perform a bit faster because only needs to read disk once (data and index stored together)

For write updates/deletes: Unclustered index will typically perform a bit faster because for unclustered index approach, the data part could be written in an append-only fashion and index part could be inserted.

Range query

Primary/Clustered index based range queries are very efficient. There might be a possibility that the disk block that the database has read from the disk contains all the data belonging to the query, since the primary index is clustered & records are ordered physically. So the locality of data can be provided by the primary index

Since the primary index contains a direct reference to the data block address through the virtual address space & disk blocks are physically organized in the order of the index key, every time the OS does some disk page split due to DML operations like INSERT / UPDATE / DELETE, the primary index also needs to be updated. So DML operations puts some pressure on the performance of the primary index.

Clustered index structure

  • If leaf nodes store actual data, it is a clustered index.

Unclustered / Secondary index structure

  • Primary index points to data and secondary index points to primary key. Primary index will be a clustered index and secondary index will be an unclustered index.

  • Secondary index only points to primary key because

    • Save space: Avoid storing copies of data.

    • Keep data consistency: When there are updates on primary key, all other secondary indexes need to be updated at the same time.

Capacity estimation

Clustered index capacity - 5M

  • First/Root layer (Store indexes only):

    • For non-leaf node, suppose that the primary key is an integer (8 Byte / 64 bits) and the address pointer to next level is also 8 bytes / 64 bits.

    • The MySQL InnoDB database engine has block size of 16 KB. It means every time you read or write data to the database, a block of disk pages of size 16 KB will be fetched from the disk into RAM, it will get manipulated and then written back to disk again if required.

    • The first layer has in total 16 KB / 16 Byte = 1K children

  • Second layer (Store indexes only):

    • 1K node with 1K * 1K = 1M children

  • Third layer (Store indexes and record):

    • For leaf node, suppose that record size is 1KB.

    • Each node could store 16KB / 1KB = 16 records.

    • In total, there could be

      • 1M * 16 = 16M records stored in an InnoDB table.

      • Store 1,048,576 * 16 = 16,777,216

    • In practice, each InnoDB usage not bigger than 5 million

Unclustered index capacity - 1G

  • The first two layers will be the same as above.

  • Unclustered index approach could store more data because all three layers of tree are indexes.

    • 1024 * 1024 * 1024 = 1G records

Unique index

  • Like primary keys, unique keys can also identify records uniquely with one difference — the unique key column can contain null values.

  • Unlike other database servers, in MySQL a unique key column can have as many null values as possible. In SQL standard, null means an undefined value. So if MySQL has to contain only one null value in a unique key column, it has to assume that all null values are the same.

CREATE UNIQUE INDEX unique_idx_1 ON index_demo (pan_no);

Covering index

  • A covering index is a special kind of composite index where all the columns specified in the query somewhere exist in the index.

Example

  • The columns mentioned in the SELECT & WHERE clauses are part of the composite index. So in this case, we can actually get the value of the age column from the composite index itself. Let’s see what the EXPLAIN command shows for this query:

-- For a composite index on (pan_no, name, age)  and the following query
SELECT age FROM index_demo WHERE pan_no = 'HJKXS9086W' AND name = 'kousik'

Composite index

Def

  • Multiple column builds a single index. MySQL lets you define indices on multiple columns, up to 16 columns. This index is called a Multi-column / Composite / Compound index. If certain fields are appearing together regularly in queries, please consider creating a composite index.

  • Let’s say we have an index defined on 4 columns — col1, col2, col3, col4. With a composite index, we have search capability on col1, (col1, col2) , (col1, col2, col3) , (col1, col2, col3, col4). So we can use any left side prefix of the indexed columns, but we can’t omit a column from the middle & use that like — (col1, col3) or (col1, col2, col4) or col3 or col4 etc. These are invalid combinations.

Why not multiple secondary index

  • MySQL uses only one index per table per query except for UNION. (In a UNION, each logical query is run separately, and the results are merged.) So defining multiple indices on multiple columns does not guarantee those indices will be used even if they are part of the query.

Left-prefix index rule

  1. Left-to-right, no skipping: MySQL can only access the index in order, starting from the leftmost column and moving to the right. It can't skip columns in the index.

  2. Stops at the first range: MySQL stops using the index after the first range condition encountered.

Hash index for == and IN

  • B+ tree index

    • Use case: Used in 99.99% case because it supports different types of queries

  • Hash index

    • Use case: Only applicable for == and IN type of query, does not support range/order by query, could not be used in left-prefix composite index

Adaptive hash index

Queries not using index

  • Used SQL query including !=, LIKE

  • Had special expressions such as math and function call.

  • When dataset size is too small.

References

  • https://medium.com/free-code-camp/database-indexing-at-a-glance-bb50809d48bd

Cluster and unclustered index
Clustered index structure
Unclustered / Secondary index structure
Capacity estimation
Clustered index capacity - 5M
Unclustered index capacity - 1G
Unique index
Covering index
Example
Composite index
Def
Why not multiple secondary index
Left-prefix index rule
Hash index for == and IN
Adaptive hash index
Queries not using index
References
Clustered index
Index B tree secondary index
Index B tree secondary index