🐝
Mess around software system design
  • README
  • ArchitectureTradeOffAnalysis
    • Estimation
    • Middleware
    • Network
    • Server
    • Storage
  • Conversion cheat sheet
  • Scenarios
    • TinyURL
      • Estimation
      • Flowchart
      • Shortening mechanisms
      • Rest API
      • Performance
      • Storage
      • Follow-up
    • TaskScheduler
      • JDK delay queue
      • Timer based
      • RabbitMQ based
      • Kafka-based fixed delay time
      • Redis-based customized delay time
      • MySQL-based customized delay time
      • Timer TimingWheel
      • Industrial Scheduler
      • Workflow Engine
      • Airflow Arch
    • GoogleDrive
      • Estimation
      • Flowchart
      • Storage
      • Follow-up
    • Youtube
      • Estimation
      • Flowchart
      • Performance
      • Storage
      • Follow-up
      • Netflix
    • Uber
      • Estimation
      • Rest api
      • Flowchart
      • KNN algorithms
      • Geohash-based KNN mechanism
      • Redis implementation
      • Storage
    • Twitter
      • Estimation
      • Flowchart
      • Storage
      • Scalability
      • Follow-up
    • Instant messenger
      • Architecture overview
      • Presence
      • Unread count
      • Notifications
      • Read receipt
      • Large group chat
      • Storage-Offline 1:1 Chat
      • Storage-Offline group chat
      • Storage-Message roaming
      • NonFunc-Realtime
      • NonFunc-Reliability
      • NonFunc-Ordering
      • NonFunc-Security
      • Livecast-LinkedIn
    • Distributed Lock
      • Single machine
      • AP model based
      • CP model based
      • Chubby-TODO
    • Payment system
      • Resilience
      • Consistency
      • Flash sale
    • Key value store
      • Master-slave KV
      • Peer-to-peer KV
      • Distributed cache
  • Time series scenarios
    • Observability
      • TimeSeries data
      • Distributed traces
      • Logs
      • Metrics
      • NonFunc requirments
  • Search engine
    • Typeahead
    • Search engine
    • Distributed crawler
      • Estimation
      • Flowchart
      • Efficiency
      • Robustness
      • Performance
      • Storage
      • Standalone implementation
      • Python Scrapy framework
    • Stream search
  • Big data
    • GFS/HDFS
      • Data flow
      • High availability
      • Consistency
    • Map reduce
    • Big table/Hbase
    • Haystack
    • TopK
    • Stateful stream
    • Lambda architecture
    • storm架构
    • Beam架构
    • Comparing stream frameworks
    • Instagram-[TODO]
  • MicroSvcs
    • Service Registry
      • Flowchart
      • Data model
      • High availability
      • Comparison
      • Implementation
    • Service governance
      • Load balancing
      • Circuit breaker
      • Bulkhead
      • Downgrade
      • Timeout
      • API gateway
      • RateLimiter
        • Config
        • Algorithm comparison
        • Sliding window
        • Industrial impl
    • MicroSvcs_ConfigCenter-[TODO]
    • MicroSvcs_Security
      • Authentication
      • Authorization
      • Privacy
  • Cache
    • Typical topics
      • Expiration algorithm
      • Access patterns
      • Cache penetration
      • Big key
      • Hot key
      • Distributed lock
      • Data consistency
      • High availability
    • Cache_Redis
      • Data structure
      • ACID
      • Performance
      • Availability
      • Cluster
      • Applications
    • Cache_Memcached
  • Message queue
    • Overview
    • Kafka
      • Ordering
      • At least once
      • Message backlog
      • Consumer idempotency
      • High performance
      • Internal leader election
    • MySQL-based msg queue
    • Other msg queues
      • ActiveMQ-TODO
      • RabbitMQ-TODO
      • RocketMQ-TODO
      • Comparison between MQ
  • Traditional DB
    • Index data structure
    • Index categories
    • Lock
    • MVCC
    • Redo & Undo logs
    • Binlog
    • Schema design
    • DB optimization
    • Distributed transactions
    • High availability
    • Scalability
    • DB migration
    • Partition
    • Sharding
      • Sharding strategies
      • Sharding ID generator overview
        • Auto-increment key
        • UUID
        • Snowflake
        • Implement example
      • Cross-shard pagination queries
      • Non-shard key queries
      • Capacity planning
  • Non-Traditional DB
    • NoSQL overview
    • Rum guess
    • Data structure
    • MySQL based key value
    • KeyValueStore
    • ObjectStore
    • ElasticSearch
    • TableStore-[TODO]
    • Time series DB
    • DistributedAcidDatabase-[TODO]
  • Java basics
    • IO
    • Exception handling
  • Java concurrency
    • Overview
      • Synchronized
      • Reentrant lock
      • Concurrent collections
      • CAS
      • Others
    • Codes
      • ThreadLocal
      • ThreadPool
      • ThreadLifeCycle
      • SingletonPattern
      • Future
      • BlockingQueue
      • Counter
      • ConcurrentHashmap
      • DelayedQueue
  • Java JVM
    • Overview
    • Dynamic proxy
    • Class loading
    • Garbage collection
    • Visibility
  • Server
    • Nginx-[TODO]
  • Distributed system theories
    • Elementary school with CAP
    • Consistency
      • Eventual with Gossip
      • Strong with Raft
      • Tunable with Quorum
      • Fault tolerant with BFT-TODO
      • AutoMerge with CRDT
    • Time in distributed system
      • Logical time
      • Physical time
    • DDIA_Studying-[TODO]
  • Protocols
    • ApiDesign
      • REST
      • RPC
    • Websockets
    • Serialization
      • Thrift
      • Avro
    • HTTP
    • HTTPS
    • Netty-TODO
  • Statistical data structure
    • BloomFilter
    • HyperLoglog
    • CountMinSketch
  • DevOps
    • Container_Docker
    • Container_Kubernetes-[TODO]
  • Network components
    • CDN
    • DNS
    • Load balancer
    • Reverse proxy
    • 云中网络-TODO
  • Templates
    • interviewRecord
  • TODO
    • RecommendationSystem-[TODO]
    • SessionServer-[TODO]
    • Disk
    • Unix philosophy and Kafka
    • Bitcoin
    • Design pattern
      • StateMachine
      • Factory
    • Akka
    • GoogleDoc
      • CRDT
Powered by GitBook
On this page
  • Use case
  • Pros
  • Cons
  • Real life applications
  • Cassandra usage
  • Content Recommendation System
  • One-hit-wonders
  • Financial Fraud Detection
  • References
  • Todo

Was this helpful?

  1. Statistical data structure

BloomFilter

PreviousNetty-TODONextHyperLoglog

Last updated 1 year ago

Was this helpful?

Use case

  • An empty bloom filter is merely a Bit Vector storing bits with all set to value as 0.

  • To add an element to the Bloom filter, we simply hash it by using our hash function and set the bits in the bit vector at the index of those hashes to 1.

  • To test for membership, you simply hash the string with the same hash functions, then see if those values are set in the bit vector. If the bits are 1 then the element is probably present, but if zero then the word is definitely not present.

Pros

  • Space efficiency: Bloom filter does not store the actual items. In this way it’s space efficient. It’s just an array of integers.

  • Saves expensive data scanning across several servers depending on the use case.

Cons

  • Removal Of Item: If you remove an item or more specifically clear set bits corresponding to some item from bloom filter, it might erase some other data as well because many set bits might be shared among multiple items. So remember if you use bloom filter, removal is not an option for you. What’s inserted in bloom filter, stays in bloom filter.

  • Inserted items are not retrievable: Bloom filters does not track items, it just sets different positions in the array. So inserted items can not be checked out.

  • False Positive result: The size of the bloom filter has to be known beforehand. Otherwise with a small size, the array will saturate and the amount of false positive result will increase. In case of saturation, you can try designing some strategy to reset the bloom filter or use a scalable bloom filter. If you keep adding more & more elements to the bloom filter, the probability of false positive increases.

  • Insertion & Search Cost: Every item goes through k hash functions. So performance during insertion or search operation depends on the efficiency of the chosen hash functions as well. Inserting element & searching element takes O(k) time as you run k hash functions and just set or check k number of indices in the array.

Real life applications

Cassandra usage

  • Apache Cassandra uses SSTable data structure on disk to save rows. So at a millions of scale, there will be thousands of SSTable files on disk. Even tens of thousands of read requests per unit time will cause very expensive disk IO operations to search for concerned row(s) in all the SSTables one by one in some order when data is not found in the in-memory tables of Cassandra. So bloom filters are used to approximately identify if some row / column with the given data or id exists in a SSTable. If the result is ‘May Be Present’, Cassandra searches in the corresponding SSTable, in case the row is not found, Cassandra continues search operation in other SSTables. So using bloom filter, Cassandra saves a lot of unnecessary SSTable scan hence saving huge disk IO operations cost.

Content Recommendation System

  • Imagine some site recommending you some articles, news, videos etc which you might not have seen earlier. So there are probably thousands of stuffs which can be recommended and you might have seen tens or hundreds of recommendation already. So in order to skip the recommendations that are already served to you, bloom filters are used. Medium uses bloom filter to avoid showing duplicate recommendations.

One-hit-wonders

  • Akamai & Facebook uses bloom filters to avoid caching the items that are very rarely searched or searched only once. Only when they are searched more than once, they will get cached. Several strategies might be designed to avoid such situations.

Financial Fraud Detection

  • If you have a credit card, your credit card company knows about your spending history — the vendors that you have transacted with previously, the category of vendors, the cities where the card was used etc. So when you make a new transaction, in the background some rules can execute which will decide whether the vendor or city or any parameter is already seen or suspicious. Bloom filters can be used to design such strategy.

References

  • https://medium.datadriveninvestor.com/bloom-filter-a-simple-but-interesting-data-structure-37fd53b11606

Todo

When bloomfilters don't bloom