🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • Optimization mindset
  • Server/Engine optimization
  • InnoDB buffer improvement
  • SQL optimization
  • Slow query
  • Choose index columns
  • Always define a primary key for each table
  • Use auto-increment int column when possible
  • Optimize ORDER BY
  • Optimize COUNT
  • FORCE INDEX, USE INDEX and IGNORE INDEX
  • Replace WHERE with HAVING
  • Optimize LIMIT XXX OFFSET YYY
  • Optimize IN
  • Join
  • IN operator
  • Unequal filter when possible
  • Filtering based on Nullable match conditions
  • Prefix based fuzzy matching
  • Functions on index
  • Computation expression on index
  • Or condition

Was this helpful?

  1. Traditional DB

DB optimization

PreviousSchema designNextDistributed transactions

Last updated 1 year ago

Was this helpful?

Optimization mindset

  • Optimize from the following four levels:

Server/Engine optimization

  • Fine-tune database software such as

    • Transaction isolation levels

    • InnoDB disk flush frequency

InnoDB buffer improvement

  • InnoDB tries to minimise disk I/O operation by using a buffer. Following is the representation:

  • InnoDB buffer inserts, deletes, and updates if the needed leaf node is not in memory. The buffer is flushed when it is full or the corresponding leaf nodes come into memory. This way InnoDB defers the disk I/O operation. But still, database write operation can be made much much faster by leveraging the available disk bandwidth which existing relational databases fail to do. Also relational database systems are very complex inside as they use locking, concurrency, ACID transaction semantics etc which makes read write operation more complex.

SQL optimization

  • Goal:

    • Reduce disk IO by avoiding full table scanning, use index when possible and use covered index.

    • Reduce CPU/memory consumption by reducing sorting, grouping, and deduplication operations.

Slow query

  • In most cases, please use EXPLAIN to understand the execution plan before optimizing. But there are some patterns practices which are known to have bad performance.

Choose index columns

  • On columns not changing often

  • On columns which have high cardinality

  • On columns whose sizes are smaller. If the column's size is big, could consider build index on its prefix.

  • Create indexes on columns frequently on "Where" conditions to avoid full-table scanning.

  • Create indexes on "Order By" to avoid sorting again when display results.

-- create index on prefix of a column
CREAT INDEX on index_name ON table(col_name(n))

Always define a primary key for each table

  1. When PRIMARY KEY is defined, InnoDB uses primary key index as the clustered index.

  2. When PRIMARY KEY is not defined, InnoDB will use the first UNIQUE index where all the key columns are NOT NULL and InnoDB uses it as the clustered index.

  3. When PRIMRARY KEY is not defined and there is no logical unique and non-null column or set of columns, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX on a synthetic column containing ROWID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The ROWID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.

Use auto-increment int column when possible

  • Why prefer auto-increment over random (e.g. UUID)?

    • In most cases, primary index uses B+ tree index.

    • For B+ tree index, if a new record has an auto-increment primary key, then it could be directly appended in the leaf node layer. Otherwise, B+ tree node split and rebalance would need to be performed.

  • Why int versus other types (string, composite primary key)?

    • Smaller footprint: Primary key will be stored within each B tree index node, making indexes sparser. Things like composite index or string based primary key will result in less index data being stored in every node.

Optimize ORDER BY

  • Add index on sorting columns

  • The underlying reason is that indexes are ordered by themselves.

SELECT * FROM xxx WHERE uid = 123 ORDER BY update_time
  • After adding indexes on columns such as uid and update_time, the query time will be optimized because indexes are sorted by themselves.

Optimize COUNT

  • SELECT COUNT(*) is a popular query. However, MySQL InnoDB engine doesn't store the total number of rows.

Use approximate numbers instead

  • Before executing actual SQl queries, use EXPLAIN command to estimate the number of records

SELECT COUNT(*) FROM xxx WHERE uid = 123

---Use Explain query
EXPLAIN SELECT COUNT(*) FROM xxx WHERE uid = 123

Use NoSQL DB to store number of records

  • How to keep the consistency between DB and noSQL

    • If short-term inconsistency is acceptable for the business domain, then asynchronously update could be adopted.

    • Use tools like Canal to watch MySQL binlog, and update the count on Redis.

FORCE INDEX, USE INDEX and IGNORE INDEX

  • In practice, when SQL queries are using the wrong index, we could use this to force index usages.

  • This should only be used in worst case scenarios.

Replace WHERE with HAVING

  • If not using aggregate functions as filter condition, we'd better write filter condition inside WHERE.

Optimize LIMIT XXX OFFSET YYY

  • Use small LIMIT values XXX. Could replace with where id > max_id.

-- Original query
select * from myshop.ecs_order_info order by myshop.ecs_order_info.order_id limit 4000000, 100

-- Optimization option 1 if order_id is continuous, 
select * from myshop.ecs_order_info order where myshop.ecs_order_info.order_id between 4000000 and 4000100

-- Optimization option 2 if order_id is not continuous,
-- Compared the original query, the child query "select order_id ..." uses covering index and will be faster.
select * from myshop.ecs_order_info where 
(myshop.ecs_order_info.order_id >= (select order_id from myshop.ecs_order_info order by order_id limit 4000000,1) limit 100)
-- Original query
select * from myshop.ecs_users u where u.last_login_time >= 1590076800 order by u.last_login_time, u.user_id limit 200000, 10

-- Optimization with join query
select * from myshop.ecs_users u (select user_id from myshop.ecs_users where u.last_login_time >= 1590076800) u1 where u1.user_id = u.user_id order by u.user_id

Optimize IN

  • Use IN for low radix attributes if leftmost prefix index could not be used

-- using dating website as an example
-- 1. Composite index: city, sex, age
select * from users_table where city == XX and sex == YY and age <= ZZ

-- 2. There will be cases where some users don't filter based on sex
select * from users_table where city == XX and age <= ZZ

-- 3. Could use IN to make WHERE clause satisfy leftmost prefix condition
select * from users_table where city == XX and Sex in ('male', 'female') and age <= ZZ

Join

  • When joining two tables, assume table A has num1 returned according to the JOIN condition, table B has num2 returned according to the JOIN condition. And Assume num1 > num2.

  • Make sure:

    • Query against table B (smaller) will be executed first.

    • filters on table A (bigger) will be based on indexed column.

  • Avoid using more than three joins in any case.

    • For join, handle that inside application code when the join is big. Business applications are easier to scale.

  • Two algorithms:

    • Block nested join

    • Nested loop join

IN operator

  • When there are too few or many operators inside IN, it might not go through index.

Unequal filter when possible

  • Don't use "IS NOT NULL" or "IS NULL": Index (binary tree) could not be created on Null values.

  • Don't use != : Index could not be used. Could use < and > combined together.

    • Select name from abc where id != 20

    • Optimized version: Select name from abc where id > 20 or id < 20

Filtering based on Nullable match conditions

  • There are only two values for a null filter (is null or is not null). In most cases it will do a whole table scanning.

Prefix based fuzzy matching

  • Use % in the beginning will cause the database for a whole table scanning.

SELECT name from abc where name like %xyz

Functions on index

  • Don't use function or expression on index column

-- Original query:
select ... from product
where to_days(out_date) - to_days(current_date) <= 30

-- Improved query:
select ... from product
where out_date <= date_add(current_date, interval 30 day)

Computation expression on index

SELECT comment_id, user_id, comment_text FROM product_comment WHERE comment_id+1 = 900001

Or condition

  • If only one condition inside OR has index.

SELECT comment_id, user_id, comment_text FROM product_comment WHERE comment_id = 900001 OR comment_text = '462eed7ac6e791292a79'

Optimization mindset
Server/Engine optimization
InnoDB buffer improvement
SQL optimization
Slow query
Choose index columns
Always define a primary key for each table
Use auto-increment int column when possible
Optimize ORDER BY
Optimize COUNT
Use approximate numbers instead
Use NoSQL DB to store number of records
FORCE INDEX, USE INDEX and IGNORE INDEX
Replace WHERE with HAVING
Optimize LIMIT XXX OFFSET YYY
Optimize IN
Join
IN operator
Unequal filter when possible
Filtering based on Nullable match conditions
Prefix based fuzzy matching
Functions on index
Computation expression on index
Or condition
https://coding.imooc.com/lesson/49.html#mid=513
https://study.163.com/course/courseLearn.htm?courseId=1209773843#/learn/video?lessonId=1280437152&courseId=1209773843
https://coding.imooc.com/lesson/49.html#mid=439