Understanding Bloom Filters In Python

Introduction

Bloom filters are a probabilistic data structure that can be used to test whether an element is a member of a set. They excel in situations where the amount of data is so large that not all of it can fit in memory.

How Does A Bloom Filter Work?

A Bloom filter consists of a bit array of a fixed size. In the beginning, all the bits in the array are set to 0. When an element comes in, it is hashed by multiple hash functions into different positions in the bit array, and these positions are set to 1.

Python Code Implementation

Below is a simple Python implementation of a Bloom filter using the built-in hashlib module for hash functions and bitarray to manage the bit array.

The Python code is shown below:

import hashlib
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size, hash_num):
        self.size = size
        self.hash_num = hash_num
        self.bit_array = bitarray(self.size)
        self.bit_array.setall(0)
    
    def add(self, string):
        for seed in range(self.hash_num):
            result = hashlib.sha256(string.encode('utf-8')).hexdigest()
            digest = int(result, 16)
            bit = digest % self.size
            self.bit_array[bit] = 1
    
    def lookup(self, string):
        for seed in range(self.hash_num):
            result = hashlib.sha256(string.encode('utf-8')).hexdigest()
            digest = int(result, 16)
            bit = digest % self.size
            if self.bit_array[bit] == 0:
              return "Nope"
        return "Probably"

In the above code, the BloomFilter class is initialized with a specified size and number of hash functions (hash_num). The add method hashes an incoming string and sets the corresponding bit in the bitarray to 1. The lookup method checks if the bits at the hashed positions are set; if they are, it's likely the element is in the set.

Use of Bloom Filter

A Python example demonstrating the usage of the above BloomFilter class is demonstrated below:

bloom = BloomFilter(500000, 7)
elements = ["dog", "cat", "bird"]

for ele in elements:
	bloom.add(ele)

for ele in elements:
	print(bloom.lookup(ele))

print(bloom.lookup("dinosaur"))

In the above code, we initialize a BloomFilter with size 500000 and 7 hash functions. We then add the strings "dog", "cat", and "bird" to the filter. When we lookup these elements, the result is "Probably", indicating that they're likely part of the set. However, when we look up "dinosaur", which was not added to the filter, the result is "Nope", indicating it's almost certainly not part of the set.

Conclusion

In this blog post, we covered the basics of Bloom Filters, how they work and how to implement one in Python. Bloom filters are an incredibly useful tool for dealing with large datasets efficiently.

Next time you're faced with a situation where you have a huge list of elements and want to quickly check if an item maybe in that list without using lots of memory, consider using a Bloom Filter!

That's the end of this blog post, happy coding!