In this post we will discuss about bloom filter and its use case.Let's create a scenario like this first.
Assume there is cycle stand in our college.And the stand has 1000 slots for parking the cycles.And usually one slot can have 4 cycles.So definitely that stand has capacity to have 4000 cycles.And it is very well known that Mr. Akash keeps his cycle in slot no 1 every day.
So if we want to know if Akash is present in college today,we just check slot no 1 and if there is any cycle available there , we say yes Akash is present in college.But it is not hundred percent correct.As we said above each slot can have four cycles,it may be possible that cycle present in slot no 1 may not belongs to Akash.
So here a case arrises, which is false positive.But if no cycle is there in slot no 1, we say that definitely Akash is absent today.So there is no chance of false negative.That is we never say that Akash is absent today in case of his presence in college.
Bloom filter is a simple hash based filter works on the same principle.It allows to store elements and help us to quickly identify many (not all) elements those are not present.Sometimes we can say that Akash is not in the college(if no cycle is there in slot 1).
We will do two things with bloom filter
1.insert an element in the filter.
2.Test an element if it is a member of bloom filter.
Bloom filter is a probabilistic data structure,that is not deterministic.We will came to know the reason in a while.
Let's take a bit array of size m.Initialize each position to zero.The idea here is that we choose k hash functions whose max value will be within the range 0 to m.Here k is constant such that k
To test for an element (whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set. if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set by chance to 1 during the insertion of other elements, which results false positive.
Deletion of elements are not allowed due to obvious reasons.Suppose we want to delete an element then defintely we need to set any of the k bit positions generated by k hash functions for that element to 0.But there will be a chance that bit which we are going to set to 0 may be the result of some other elements.
To add an element in the filter simply pass that element to the k hash functions.Now we will get k different values between 0 to m-1 i.e we ll get k different array positions.Mark these positions to 1.Note that here we are not putting the exact hash value in array we are simply marking that position to 1.
As false negatives are not allowed in Bloom filter, we can't delete the elements from it.
1.Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks.
2.The Google Chrome web browser use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).
Although the Bloom Filter is a data structure ,it is called a filter because because it often used as a first pass to filter out elements of a data set that dont match a certain criteria.
Please refer wikipedia for more applications and detailed explanations.
Assume there is cycle stand in our college.And the stand has 1000 slots for parking the cycles.And usually one slot can have 4 cycles.So definitely that stand has capacity to have 4000 cycles.And it is very well known that Mr. Akash keeps his cycle in slot no 1 every day.
So if we want to know if Akash is present in college today,we just check slot no 1 and if there is any cycle available there , we say yes Akash is present in college.But it is not hundred percent correct.As we said above each slot can have four cycles,it may be possible that cycle present in slot no 1 may not belongs to Akash.
So here a case arrises, which is false positive.But if no cycle is there in slot no 1, we say that definitely Akash is absent today.So there is no chance of false negative.That is we never say that Akash is absent today in case of his presence in college.
Bloom filter is a simple hash based filter works on the same principle.It allows to store elements and help us to quickly identify many (not all) elements those are not present.Sometimes we can say that Akash is not in the college(if no cycle is there in slot 1).
Use Case:
Suppose we are going to create an anti virus software, which will maintain the list of malicious site and a list of know viruses.A naive approach is to maintain data structure to hold the details of all the malicious programmes.A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious programmes, and programmes need an average of 10 bytes to store, then you need 10 megabytes of storage. That’s definitely an overhead . Is there any efficient way?Yes there is.Implementation Details:
We will do two things with bloom filter
1.insert an element in the filter.
2.Test an element if it is a member of bloom filter.
Bloom filter is a probabilistic data structure,that is not deterministic.We will came to know the reason in a while.
Let's take a bit array of size m.Initialize each position to zero.The idea here is that we choose k hash functions whose max value will be within the range 0 to m.Here k is constant such that k
To test for an element (whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set. if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set by chance to 1 during the insertion of other elements, which results false positive.
Deletion of elements are not allowed due to obvious reasons.Suppose we want to delete an element then defintely we need to set any of the k bit positions generated by k hash functions for that element to 0.But there will be a chance that bit which we are going to set to 0 may be the result of some other elements.
To add an element in the filter simply pass that element to the k hash functions.Now we will get k different values between 0 to m-1 i.e we ll get k different array positions.Mark these positions to 1.Note that here we are not putting the exact hash value in array we are simply marking that position to 1.
As false negatives are not allowed in Bloom filter, we can't delete the elements from it.
Applications of Bloom Filter:
1.Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks.
2.The Google Chrome web browser use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).
Although the Bloom Filter is a data structure ,it is called a filter because because it often used as a first pass to filter out elements of a data set that dont match a certain criteria.
Please refer wikipedia for more applications and detailed explanations.









 
No comments:
Post a Comment