🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • Scenario
  • Storage
  • How to save a file in one machine
  • How to save a much larger file in one machine
  • Scale
  • Architecture style
  • How to save an extra large file on several machines
  • Write process
  • Do not support modification
  • Read process
  • Master task
  • Failure and recovery

Was this helpful?

  1. Non-Traditional DB

ObjectStore

PreviousKeyValueStoreNextElasticSearch

Last updated 3 years ago

Was this helpful?

Scenario

  • Write a file

  • Read a file

  • Use multiple machines to store these files

Storage

How to save a file in one machine

  • Metadata

    • FileInfo

      • Name = dengchao.mp4

      • CreatedTime = 201505031232

      • Size = 2044323

    • Index

      • Block 11 -> diskOffset1

      • Block 12 -> diskOffset2

      • Block 13 -> diskOffset3

  • Block

    • 1 block = 1024 Byte

    • Advantages

      • Error checking

      • Fragmenting the data for storage

How to save a much larger file in one machine

  • Change chunk size

    • 1 chunk = 64M = 64 * 1024K

    • Advantages

      • Reduce size of metadata

    • Disadvantages

      • Waste space for small files

Scale

Architecture style

  • Peer 2 Peer (BitComet, Cassandra)

    • Advantage: No single point of failure

    • Disadvantage: Multiple machines need to negotiate with each other

  • Master slave

    • Advantage: Simple design. Easy to keep data consistent

    • Disadvantage: Master is a single point of failure

  • Final decision

    • Master + slave

    • Restart the single master

How to save an extra large file on several machines

  • One master + many chunk servers

Move chunk offset from master to slaves

  • Master don't record the disk offset of a chunk

    • Advantage: Reduce the size of metadata in master; Reduce the traffic between master and chunk server

Write process

  1. The client divides the file into chunks. Create a chunk index for each chunk

  2. Send (FileName, chunk index) to master and master replies with assigned chunk servers

  3. The client transfer data with the assigned chunk server.

Do not support modification

Read process

  1. The client sends (FileName) to master and receives a chunk list (chunk index, chunk server) from the master

  2. The client connects with different server for reading files

Master task

  • Store metadata for different files

  • Store Map (file name + chunk index -> chunk server)

    • Find corresponding server when reading in data

    • Write to more available chunk server

Failure and recovery

Single master

  • Double master (Apache Hadoop Goes Realtime at Facebook)

  • Multi master (Paxos algorithm)

What if a chunk is broken

  • Check sum 4bytes = 32 bit

  • Each chunk has a checksum

  • Write checksum when writing out a chunk

  • Check checsum when reading in a chunk

Avoid loss of data when chunk server is down

  • Replica: 3 copies

    • Two copies in the same data center but on different racks

    • Third copy in a different data center

  • How to choose chunk servers

    • Find servers which are not busy

    • Find servers with lots of available disk space

How to recover when a chunk is broken

  • Ask master for help

How to find whether a chunk server is down

  • Heart beat message

How to solve client bottleneck

  • Client only writes to a leader chunk server. The leader chunk server is responsible for communicating with other chunk servers.

  • How to select leading slaves

How to solve chunk server failure

  • Ask the client to retry

File system
Scenario
Storage
How to save a file in one machine
How to save a much larger file in one machine
Scale
Architecture style
How to save an extra large file on several machines
Move chunk offset from master to slaves
Write process
Do not support modification
Read process
Master task
Failure and recovery
Single master
What if a chunk is broken
Avoid loss of data when chunk server is down
How to recover when a chunk is broken
How to find whether a chunk server is down
How to solve client bottleneck
How to solve chunk server failure