There are over 60 variants of Bloom filters, many surveys of the field, and a continuing churn of applications (see e.g., Luo,
et al ). Some of the variants differ sufficiently from the original proposal to be breaches from or forks of the original data structure and its philosophy. for one attempt inspired by neuroscience).
Cache filtering Content delivery networks deploy
web caches around the world to cache and serve web content to users with greater performance and reliability. A key application of Bloom filters is their use in efficiently determining which web objects to store in these web caches. Nearly three-quarters of the URLs accessed from a typical web cache are "one-hit-wonders" that are accessed by users only once and never again. It is clearly wasteful of disk resources to store one-hit-wonders in a web cache, since they will never be accessed again. To prevent caching one-hit-wonders, a Bloom filter is used to keep track of all URLs that are accessed by users. A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request. The use of a Bloom filter in this fashion significantly reduces the disk write workload, since most one-hit-wonders are not written to the disk cache. Further, filtering out the one-hit-wonders also saves cache space on disk, increasing the cache hit rates.
Avoiding false positives in a finite universe Kiss
et al described a new construction for the Bloom filter that avoids false positives in addition to the typical non-existence of false negatives. The construction applies to a finite universe from which set elements are taken. It relies on existing non-adaptive combinatorial
group testing scheme by Eppstein, Goodrich and Hirschberg. Unlike the typical Bloom filter, elements are hashed to a bit array through deterministic, fast and simple-to-calculate functions. The maximal set size for which false positives are completely avoided is a function of the universe size and is controlled by the amount of allocated memory. Alternatively, an initial Bloom filter can be constructed in the standard way and then, with a finite and tractably-enumerable domain, all false positives can be exhaustively found and then a second Bloom filter constructed from that list; false positives in the second filter are similarly handled by constructing a third, and so on. As the universe is finite and the set of false positives strictly shrinks with each step, this procedure results in a finite
cascade of Bloom filters that (on this closed, finite domain) will produce only true positives and true negatives. To check for membership in the filter cascade, the initial filter is queried, and, if the result is positive, the second filter is then consulted, and so on. This construction is used in
CRLite, a proposed
certificate revocation status distribution mechanism for the
Web PKI, and
Certificate Transparency is exploited to close the set of extant certificates.
Counting Bloom filters Counting filters provide a way to implement a
delete operation on a Bloom filter without recreating the filter afresh. In a counting filter, the array positions (buckets) are extended from being a single bit to being a multibit counter. In fact, regular Bloom filters can be considered as counting filters with a bucket size of one bit. Counting filters were introduced by . The insert operation is extended to
increment the value of the buckets, and the lookup operation checks that each of the required buckets is non-zero. The delete operation then consists of decrementing the value of each of the respective buckets.
Arithmetic overflow of the buckets is a problem and the buckets should be sufficiently large to make this case rare. If it does occur then the increment and decrement operations must leave the bucket set to the maximum possible value in order to retain the properties of a Bloom filter. The size of counters is usually 3 or 4 bits. Hence counting Bloom filters use 3 to 4 times more space than static Bloom filters. In contrast, the data structures of and also allow deletions but use less space than a static Bloom filter. Another issue with counting filters is limited scalability. Because the counting Bloom filter table cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must be known in advance. Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted. introduced a data structure based on d-left hashing that is functionally equivalent but uses approximately half as much space as counting Bloom filters. The scalability issue does not occur in this data structure. Once the designed capacity is exceeded, the keys could be reinserted in a new hash table of double size. The space efficient variant by could also be used to implement counting filters by supporting insertions and deletions. introduced a new general method based on variable increments that significantly improves the false positive probability of counting Bloom filters and their variants, while still supporting deletions. Unlike counting Bloom filters, at each element insertion, the hashed counters are incremented by a hashed variable increment instead of a unit increment. To query an element, the exact values of the counters are considered and not just their positiveness. If a sum represented by a counter value cannot be composed of the corresponding variable increment for the queried element, a negative answer can be returned to the query. Kim et al. (2019) shows that false positive of Counting Bloom filter decreases from k=1 to a point defined k_{opt}, and increases from k_{opt}to positive infinity, and finds k_{opt}as a function of count threshold.
Decentralized aggregation Bloom filters can be organized in distributed
data structures to perform fully decentralized computations of
aggregate functions. Decentralized aggregation makes collective measurements locally available in every node of a distributed network without involving a centralized computational entity for this purpose.
Distributed Bloom filters Parallel Bloom filters can be implemented to take advantage of the multiple
processing elements (PEs) present in
parallel shared-nothing machines. One of the main obstacles for a parallel Bloom filter is the organization and communication of the unordered data which is, in general, distributed evenly over all PEs at the initiation or at batch insertions. To order the data two approaches can be used, either resulting in a Bloom filter over all data being stored on each PE, called replicating bloom filter, or the Bloom filter over all data being split into equal parts, each PE storing one part of it. For both approaches a "Single Shot" Bloom filter is used which only calculates one hash, resulting in one flipped bit per element, to reduce the communication volume.
Distributed Bloom filters are initiated by first hashing all elements on their local PE and then sorting them by their hashes locally. This can be done in linear time using e.g.
Bucket sort and also allows local duplicate detection. The sorting is used to group the hashes with their assigned PE as separator to create a Bloom filter for each group. After encoding these Bloom filters using e.g.
Golomb coding each bloom filter is sent as packet to the PE responsible for the hash values that where inserted into it. A PE p is responsible for all hashes between the values p*(s/|\text{PE}|) and (p+1)*(s/|\text{PE}|), where s is the total size of the Bloom filter over all data. Because each element is only hashed once and therefore only a single bit is set, to check if an element was inserted into the Bloom filter only the PE responsible for the hash value of the element needs to be operated on. Single insertion operations can also be done efficiently because the Bloom filter of only one PE has to be changed, compared to Replicating Bloom filters where every PE would have to update its Bloom filter. By distributing the global Bloom filter over all PEs instead of storing it separately on each PE the Bloom filters size can be far larger, resulting in a larger capacity and lower false positive rate. Distributed Bloom filters can be used to improve duplicate detection algorithms by filtering out the most 'unique' elements. These can be calculated by communicating only the hashes of elements, not the elements themselves which are far larger in volume, and removing them from the set, reducing the workload for the duplicate detection algorithm used afterwards. During the communication of the hashes the PEs search for bits that are set in more than one of the receiving packets, as this would mean that two elements had the same hash and therefore could be duplicates. If this occurs a message containing the index of the bit, which is also the hash of the element that could be a duplicate, is sent to the PEs which sent a packet with the set bit. If multiple indices are sent to the same PE by one sender it can be advantageous to encode the indices as well. All elements that didn't have their hash sent back are now guaranteed to not be a duplicate and won't be evaluated further, for the remaining elements a Repartitioning algorithm can be used. First all the elements that had their hash value sent back are sent to the PE that their hash is responsible for. Any element and its duplicate is now guaranteed to be on the same PE. In the second step each PE uses a
sequential algorithm for duplicate detection on the receiving elements, which are only a fraction of the amount of starting elements. By allowing a false positive rate for the duplicates, the communication volume can be reduced further as the PEs don't have to send elements with duplicated hashes at all and instead any element with a duplicated hash can simply be marked as a duplicate. As a result, the false positive rate for duplicate detection is the same as the false positive rate of the used bloom filter. The process of filtering out the most 'unique' elements can also be repeated multiple times by changing the hash function in each filtering step. If only a single filtering step is used it has to archive a small false positive rate, however if the filtering step is repeated once the first step can allow a higher false positive rate while the latter one has a higher one but also works on less elements as many have already been removed by the earlier filtering step. While using more than two repetitions can reduce the communication volume further if the number of duplicates in a set is small, the payoff for the additional complications is low.
Replicating Bloom filters organize their data by using a well known
hypercube algorithm for gossiping, e.g. First each PE calculates the Bloom filter over all local elements and stores it. By repeating a loop where in each step i the PEs send their local Bloom filter over dimension i and merge the Bloom filter they receive over the dimension with their local Bloom filter, it is possible to double the elements each Bloom filter contains in every iteration. After sending and receiving Bloom filters over all \log |\text{PE}| dimensions each PE contains the global Bloom filter over all elements. Replicating Bloom filters are more efficient when the number of queries is much larger than the number of elements that the Bloom filter contains, the break even point compared to Distributed Bloom filters is approximately after |\text{PE}| * |\text{Elements}| / \log_{f^\text{+}} |\text{PE}| accesses, with f^\text{+} as the false positive rate of the bloom filter.
Data synchronization Bloom filters can be used for approximate
data synchronization as in . Counting Bloom filters can be used to approximate the number of differences between two sets and this approach is described in .
Bloom filters for streaming data Bloom filters can be adapted to the context of streaming data. For instance, proposed Stable Bloom filters, which consist of a counting Bloom filter where insertion of a new element sets the associated counters to a value , and then only a fixed amount of counters are decreased by 1, hence the memory mostly contains information about recent elements (intuitively, one could assume that the lifetime of an element inside a SBF of counters is around c \tfrac s N). Another solution is the Aging Bloom filter, that consists of two Bloom filter each occupying half the total available memory: when one filter is full, the second filter is erased and newer elements are then added to this newly empty filter. However, it has been shown that no matter the filter, after insertions, the sum of the false positive FP and false negative FN probabilities is bounded below by FP + FN \geq 1 - \frac{1 - \left(1 - \frac {1}{L} \right)^m}{1 - \left( 1 - \frac {1}{L} \right)^{n}} where is the amount of all possible elements (the alphabet size), the memory size (in bits), assuming n > m. This result shows that for big enough and going to infinity, then the lower bound converges to FP + FN = 1, which is the characteristic relation of a random filter. Hence, after enough insertions, and if the alphabet is too big to be stored in memory (which is assumed in the context of probabilistic filters), it is impossible for a filter to perform better than randomness. This result can be leveraged by only expecting a filter to operate on a sliding window rather than the whole stream. In this case, the exponent in the formula above is replaced by , which gives a formula that might deviate from 1, if is not too small.
Bloomier filters designed a generalization of Bloom filters that could associate a value with each element that had been inserted, implementing an
associative array. Like Bloom filters, these structures achieve a small space overhead by accepting a small probability of false positives. In the case of "Bloomier filters", a
false positive is defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that
is in the map. 0 and
B0 which contain, respectively, all keys mapping to 0 and all keys mapping to 1. Then, to determine which value a given key maps to, we look it up in both filters. If it is in neither, then the key is not in the map. If the key is in
A0 but not
B0, then it does not map to 1, and has a high probability of mapping to 0. Conversely, if the key is in
B0 but not
A0, then it does not map to 0 and has a high probability of mapping to 1. A problem arises, however, when
both filters claim to contain the key. We never insert a key into both, so one or both of the filters is lying (producing a false positive), but we don't know which. To determine this, we have another, smaller pair of filters
A1 and
B1.
A1 contains keys that map to 0 and which are false positives in
B0;
B1 contains keys that map to 1 and which are false positives in
A0. But whenever
A0 and
B0 both produce positives, at most one of these cases must occur, and so we simply have to determine which if any of the two filters
A1 and
B1 contains the key, another instance of our original problem. It may so happen again that both filters produce a positive; we apply the same idea recursively to solve this problem. Because each pair of filters only contains keys that are in the map
and produced false positives on all previous filter pairs, the number of keys is extremely likely to quickly drop to a very small quantity that can be easily stored in an ordinary deterministic map, such as a pair of small arrays with linear search. Moreover, the average total search time is low, because almost all queries will be resolved by the first pair, almost all remaining queries by the second pair, and so on. The total space required is in practice independent of
n, and is almost entirely occupied by the first filter pair. Now that we have the structure and a search algorithm, we also need to know how to insert new key/value pairs. The program must not attempt to insert the same key with both values. If the value is 0, insert the key into
A0 and then test if the key is in
B0. If so, this is a false positive for
B0, and the key must also be inserted into
A1 recursively in the same manner. If we reach the last level, we simply insert it. When the value is 1, the operation is similar but with
A and
B reversed. Now that we can map a key to the value 0 or 1, how does this help us map to general values? This is simple. We create a single such Bloomier filter for each bit of the result. If the values are large, we can instead map keys to hash values that can be used to retrieve the actual values. The space required for a Bloomier filter with
n-bit values is typically slightly more than the space for 2
n Bloom filters. A very simple way to implement Bloomier filters is by means of minimal
perfect hashing. A minimal perfect hash function h is first generated for the set of n keys. Then an array is filled with n pairs (signature,value) associated with each key at the positions given by function h when applied on each key. The signature of a key is a string of r bits computed by applying a hash function g of range 2^r on the key. The value of r is chosen such that 2^r>=1/\varepsilon, where \varepsilon is the requested false positive rate. To query for a given key, hash function h is first applied on the key. This will give a position into the array from which we retrieve a pair (signature,value). Then we compute the signature of the key using function g. If the computed signature is the same as retrieved signature we return the retrieved value. The probability of false positive is 1/2^r. Another alternative to implement static Bloomier and Bloom filters based on matrix solving has been simultaneously proposed in , and . The space usage of this method is optimal as it needs only \log_2(\varepsilon) bits per key for a Bloom filter. However time to generate the Bloom or Bloomier filter can be very high. The generation time can be reduced to a reasonable value at the price of a small increase in space usage. Dynamic Bloomier filters have been studied by . They proved that any dynamic Bloomier filter needs at least around \log(l) bits per key where l is the length of the key. A simple dynamic version of Bloomier filters can be implemented using two dynamic data structures. Let the two data structures be noted S1 and S2. S1 will store keys with their associated data while S2 will only store signatures of keys with their associated data. Those signatures are simply hash values of keys in the range [0,n/\varepsilon] where n is the maximal number of keys to be stored in the Bloomier filter and \varepsilon is the requested false positive rate. To insert a key in the Bloomier filter, its hash value is first computed. Then the algorithm checks if a key with the same hash value already exists in S2. If this is not the case, the hash value is inserted in S2 along with data associated with the key. If the same hash value already exists in S2 then the key is inserted into S1 along with its associated data. The deletion is symmetric: if the key already exists in S1 it will be deleted from there, otherwise the hash value associated with the key is deleted from S2. An issue with this algorithm is on how to store efficiently S1 and S2. For S1 any hash algorithm can be used. To store S2
golomb coding could be applied to compress signatures to use a space close to \log2(1/\varepsilon) per key. -->
Compact approximators proposed a
lattice-based generalization of Bloom filters. A
compact approximator associates to each key an element of a lattice (the standard Bloom filters being the case of the Boolean two-element lattice). Instead of a bit array, they have an array of lattice elements. When adding a new association between a key and an element of the lattice, they compute the maximum of the current contents of the k array locations associated to the key with the lattice element. When reading the value associated to a key, they compute the minimum of the values found in the k locations associated to the key. The resulting value approximates from above the original value.
Parallel-partitioned Bloom filters This implementation used a separate array for each hash function. This method allows for parallel hash calculations for both insertions and inquiries.
Scalable Bloom filters proposed a variant of Bloom filters that can adapt dynamically to the number of elements stored, while assuring a minimum false positive probability. The technique is based on sequences of standard Bloom filters with increasing capacity and tighter false positive probabilities, so as to ensure that a maximum false positive probability can be set beforehand, regardless of the number of elements to be inserted.
Spatial Bloom filters Spatial Bloom filters (SBF) were originally proposed by as a data structure designed to store
location information, especially in the context of cryptographic protocols for location
privacy. However, the main characteristic of SBFs is their ability to store
multiple sets in a single data structure, which makes them suitable for a number of different application scenarios. Membership of an element to a specific set can be queried, and the false positive probability depends on the set: the first sets to be entered into the filter during construction have higher false positive probabilities than sets entered at the end. This property allows a prioritization of the sets, where sets containing more "important" elements can be preserved.
Layered Bloom filters A layered Bloom filter consists of multiple Bloom filter layers. Layered Bloom filters allow keeping track of how many times an item was added to the Bloom filter by checking how many layers contain the item. With a layered Bloom filter a check operation will normally return the deepest layer number the item was found in.
Attenuated Bloom filters An attenuated Bloom filter of depth D can be viewed as an array of D normal Bloom filters. In the context of
service discovery in a network, each node stores regular and attenuated Bloom filters locally. The regular or local Bloom filter indicates which services are offered by the node itself. The attenuated filter of level i indicates which services can be found on nodes that are i-hops away from the current node. The i-th value is constructed by taking a union of local Bloom filters for nodes i-hops away from the node. For example, consider a small network, shown on the graph below. Say we are searching for a service A whose id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node to be the starting point. First, we check whether service A is offered by n1 by checking its local filter. Since the patterns don't match, we check the attenuated Bloom filter in order to determine which node should be the next hop. We see that n2 doesn't offer service A but lies on the path to nodes that do. Hence, we move to n2 and repeat the same procedure. We quickly find that n3 offers the service, and hence the destination is located. By using attenuated Bloom filters consisting of multiple layers, services at more than one hop distance can be discovered while avoiding saturation of the Bloom filter by attenuating (shifting out) bits set by sources further away.
Chemical structure searching Bloom filters are often used to search large chemical structure databases (see
chemical similarity). In the simplest case, the elements added to the filter (called a
fingerprint in this field) are just the atomic numbers present in the molecule, or a hash based on the
atomic number of each atom and the number and type of its bonds. This case is too simple to be useful. More advanced filters also encode atom counts, larger substructure features like carboxyl groups, and graph properties like the number of rings. In hash-based fingerprints, a hash function based on atom and bond properties is used to turn a subgraph into a
PRNG seed, and the first output values used to set bits in the Bloom filter. Molecular fingerprints started in the late 1940s as way to search for chemical structures searched on punched cards. However, it wasn't until around 1990 that Daylight Chemical Information Systems, Inc. introduced a hash-based method to generate the bits, rather than use a precomputed table. Unlike the dictionary approach, the hash method can assign bits for substructures which hadn't previously been seen. In the early 1990s, the term "fingerprint" was considered different from "structural keys", but the term has since grown to encompass most molecular characteristics which can be used for a similarity comparison, including structural keys, sparse count fingerprints, and 3D fingerprints. Unlike Bloom filters, the Daylight hash method allows the number of bits assigned per feature to be a function of the feature size, but most implementations of Daylight-like fingerprints use a fixed number of bits per feature, which makes them a Bloom filter. The original Daylight fingerprints could be used for both similarity and screening purposes. Many other fingerprint types, like the popular ECFP2, can be used for similarity but not for screening because they include local environmental characteristics that introduce false negatives when used as a screen. Even if these are constructed with the same mechanism, these are not Bloom filters because they cannot be used to filter. ==See also==