Sunday, October 23, 2016

Bloom Filters By Example

In this post we will discuss about bloom filter and its use case.Let's create a scenario like this first.
Assume there is cycle stand in our college.And the stand has 1000 slots for parking the cycles.And usually  one slot can have 4 cycles.So definitely that stand has capacity to have 4000 cycles.And it is very well known  that Mr. Akash keeps his cycle in slot no 1 every day.

So if we want to know if Akash is present in college today,we just check slot no 1 and if there is any cycle available there , we say yes Akash is present in college.But it is not hundred percent correct.As we said above each slot can have four cycles,it may be possible that cycle present in slot no 1 may not belongs to Akash.

So here a case arrises, which is false positive.But if no cycle is there in slot no 1, we say that definitely Akash is absent today.So there is no chance of false negative.That is we never say that Akash is absent today in case of his presence in college.

Bloom filter is a simple hash based  filter works  on the same principle.It allows to store elements and help us to quickly  identify many (not all) elements those are not present.Sometimes we can say that Akash is not in the college(if no cycle is there in slot 1).


Use Case:

Suppose we are going to create an anti virus software, which will maintain the list of malicious site and a list of know viruses.A naive approach is to maintain data structure to hold the details of all the malicious programmes.A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious programmes, and programmes  need  an average of 10 bytes to store, then you need 10 megabytes of storage. That’s definitely an overhead . Is there any efficient way?Yes there  is.

Implementation Details:


We will do two things with bloom filter
1.insert an element in the filter.
2.Test an element if it is a member of bloom filter.

Bloom filter is a probabilistic data structure,that is not deterministic.We will came to know the reason in a while.

Let's take a bit array of size m.Initialize each position to zero.The idea here is that we choose k hash functions whose max value will be within the range 0 to m.Here k is constant such that k
To test for an element (whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set. if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set by chance to 1 during the insertion of other elements, which results  false positive.

Deletion of elements are not allowed due to obvious reasons.Suppose we want to delete an element then defintely we need to set any of the  k bit positions generated by k hash functions for that element to 0.But there will be a chance that bit which we are going to set to 0 may be the result of some other elements.


To add an element in the filter simply pass that element to the k hash functions.Now we will get k different values between 0 to m-1 i.e we ll get k different array positions.Mark these positions to 1.Note that here we are not putting the exact hash value in array we are simply marking that position to 1.
As false negatives are not allowed in Bloom filter, we can't delete the elements from it.


Applications of Bloom Filter:


1.Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks.
2.The Google Chrome web browser use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).


Although the Bloom Filter is a data structure ,it is called a filter because because it often used as a first pass to filter out elements of a data set that dont match a certain criteria.

Please refer wikipedia  for more applications and detailed explanations.

Wednesday, October 19, 2016

Least Recently Used (LRU) cache implementation in Java

I was working on a requirement where the requirement was to cache urls coming continuously from a source.And our cache has limited size.We can't afford to store all the urls.So here we decided to store 5 lakh most recently used urls in our cache.And those urls are not in use since a while(least recently used urls) will be evicted and new urls coming from the source will be added in the cache.If the new url is already present in cache we will mark the url as most recently used.
   
For more clarity on caches,please refer the blog post caches.  Now let's decide what data structure we will use.We can simply think of storing the urls in an LinkedList of size 5 lakh.But there are some limitations of using LinkedList.

Limitation 1:

What will happen if we want to find the least recently used urls.It is our frequent requirement as we want to evict the least recently used url from the cache to keep it free for other most recently used urls.

But the complexity for this is of order n i.e. O(n),which is not desired.

Limitation 2:

But what will happen if a url comes from the source and we want to check whether the url    already exist in the cache or not?For this we have to traverse the whole list.
  
But the complexity for this is of order n ie. O(n), which is not desired.
   
From this we conclude that LinkedList is not the correct choice for this.



To solve the first problem we use a doublyLinkedList,where the least recently used elements will be available in the tail of the list and can be accessed in O(1) time.And most recently used elements will be in the head of the list.

 To Solve the second problem we use a hash map,so that we  can check whether   an url is available in cache or not in O(1) time.

So to create a LRU cache, we have to take the help of two data structure namely a DoublyLinkedList and a HashMap.

Please see the implementation below.It is straight forward.




package com.brainatjava.lru;



import java.util.HashMap;

import java.util.Map;



public class LRUCache {

     

    private DoublyLinkedList urlList;

    private Map urleMap;
     
    public LRUCache(int cacheSize) {
      urlList = new DoublyLinkedList(4);
      urleMap = new HashMap();
    }
     
    public void accessURL(String url ) {
        Node pageNode = null;
        if(urleMap.containsKey(url)) {
            // If url is present in the cache, move the page to the head of list
            pageNode = urleMap.get(url);
            urlList.takeURLToHead(pageNode);
        } else {
            // If the page is not present in the urlcache, add the page to the urlcache
            if(urlList.getCurrSize() == urlList.getSize()) {
                // If cache is full, we will remove the tail from the cache 
                // and  remove it from urlmap.
              urleMap.remove(urlList.getTail().getURL());
            }
            pageNode = urlList.addPageToList(url);
            urleMap.put(url, pageNode);
        }
    }
     
    public void printCacheState() {
      urlList.printList();
        System.out.println();
    }

    public static void main(String[] args) {
        int cacheSize = 4;
        LRUCache cache = new LRUCache(cacheSize);
        cache.accessURL("http://a");
        cache.printCacheState();
        cache.accessURL("http://b");
        cache.printCacheState();
        cache.accessURL("http://a");
        cache.printCacheState();
        cache.accessURL("http://a");
        cache.printCacheState();
        cache.accessURL("http://d");
        cache.printCacheState();
        cache.accessURL("http://c");
        cache.printCacheState();
        cache.accessURL("http://g");
        cache.printCacheState();
        cache.accessURL("http://h");
        cache.printCacheState();
        cache.accessURL("http://c");
        cache.printCacheState();
    }
}

class DoublyLinkedList {
     
    private final int size;
    private int currSize;
    private Node head;
    private Node tail;

    public DoublyLinkedList(int size) {
        this.size = size;
        currSize = 0;
    }

    public Node getTail() {
        return tail;
    }

    public void printList() {
        if(head == null) {
            return;
        }
        Node tmp = head;
        while(tmp != null) {
            System.out.print(tmp);
            tmp = tmp.getNext();
        }
    }

    public Node addPageToList(String url) {
        Node pageNode = new Node(url);       
        if(head == null) {
            head = pageNode;
            tail = pageNode; 
            currSize = 1;
            return pageNode;
        } else if(currSize < size) {
            currSize++;
        } else {
            tail = tail.getPrev();
            tail.setNext(null);
        }
        pageNode.setNext(head);
        head.setPrev(pageNode);
        head = pageNode;
        return pageNode;
    }

    public void takeURLToHead(Node node) {
        if(node == null || node == head) {
            return;
        }

        if(node == tail) {
            tail = tail.getPrev();
            tail.setNext(null);
        }
         
        Node prev = node.getPrev();
        Node next = node.getNext();
        prev.setNext(next);

        if(next != null) {
            next.setPrev(prev);
        }

        node.setPrev(null);
        node.setNext(head);
        head.setPrev(node);
        head = node;    
    }

    public int getCurrSize() {
        return currSize;
    }

    public void setCurrSize(int currSize) {
        this.currSize = currSize;
    }

    public Node getHead() {
        return head;
    }

    public void setHead(Node head) {
        this.head = head;
    }

    public int getSize() {
        return size;
    }   
}

class Node {
     
    private String url;
    private Node prev;
    private Node next;
     
    public Node(String url) {
        this.url = url;
    }

    public String getURL() {
        return url;
    }

    public void setURL(String url) {
        this.url = url;
    }
     
    public Node getPrev() {
        return prev;
    }

    public void setPrev(Node prev) {
        this.prev = prev;
    }

    public Node getNext() {
        return next;
    }

    public void setNext(Node next) {
        this.next = next;
    }
     
    public String toString() {
        return url + "  ";
    }
}
 

 The method we applied here uses a DoublyLinkedList and HashMap.Double linked list is to maintain insertion order and finding the tail of the cache in O(1) time.And HashMap to check if an url is already exist in cache in O(1) time.

But Java has a lesser known  data structure know as LinkedHashMap ,which provides both the features of doublly linked list and hashmap.

  But here to remember that by default the LinkedHashMap  order is the insertion order, not access order.But there is a constructor of it to provide the access order for the same.We should use the below constructor LinkedHashMap(int initialCapacity, float loadFactor, boolean accessOrder) to solve the purpose.
 
  //Stright from the java doc
 A special constructor is provided to create a linked hash map whose order of iteration is the order in which its entries were last accessed, from least-recently accessed to most-recently (access-order). This kind of map is well-suited to building LRU caches. Invoking the put or get method results in an access to the corresponding entry (assuming it exists after the invocation completes). The putAll method generates one entry access for each mapping in the specified map, in the order that key-value mappings are provided by the specified map's entry set iterator. No other methods generate entry accesses. In particular, operations on collection-views do not affect the order of iteration of the backing map.
 link http://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html

Let's see the implementation


package com.brainatjava.lru;

import java.util.LinkedHashMap;

public class LRUCache extends LinkedHashMap {
 private int size;

 private LRUCache(int size){
     super(size, 0.75f, true);
     this.size =size;
 }
  
 @Override
    protected boolean removeEldestEntry(java.util.Map.Entry eldest) {
        // TODO Auto-generated method stub
    return size() > size;
    }

 @Override
    public V get(Object key) {
        // TODO Auto-generated method stub
        return super.get(key);
    }

 public static  LRUCache newInstance(int size) {
     return new LRUCache(size);
 }

}

Thursday, October 13, 2016

Caches

First of all, let’s understand what is a cache? In plain computer science terms, a cache is small buffer of pages OS maintains in order to avoid more expensive main memory accesses.

Cache is usually present on the CPU chip itself.Main memory(RAM) is placed on the motherboard and is connected to the CPU.
Because cache is closer to the CPU, it is much faster than RAM. Each read access on the main memory has to travel to CPU while the CPU cache is right there.

Cache is more expensive than primary memory.

Why to have another temporary memory when we already have cheap and large main memory?


It is mainly to improve speed.The cache is there to reduce the average memory access time for the CPU.

When the CPU needs some data from the memory, the cache is checked first and if data is available in the cache it gets it from there. There is no need to perform a memory read.

Caches are faster than main memory, however, they smaller in size compared to main memory.  Therefore, there is probability pages are swapped in and out of cache. If a page is not found in cache and main memory is accessed to fetch that page, it’s a cache miss. Page is brought in cache and next time when it is accessed, it is served from cache.

What if there is no space left in cache when a cache miss occurs? New page as to swapped with one of already existing pages in cache. How to decide which page goes out of cache, so that there is minimum increase in cache miss? There are many approaches (eviction methods) to decide which page goes out of cache to make space for new page like First In First Out approach, Least Recently Used, Least Frequently Used etc.
What is least recently used cache ?

In ‘First In First Out’ approach, OS selects page which is oldest in cache and swaps that it with new page. In ‘Least Recently Used’ approach, OS selects the page which was not accessed for longest period of time. In ‘Least Frequently Used’ approach, OS selects the page which is accessed least number of time till a given point of time.

In this post, we would concentrate on Least Recently Used approach and implement it.

LRU cache is similar to first in first out (FIFO) storage where pages which came first are evicted first. Difference between FIFO storage and LRU cache comes from the fact that when a page is accessed again, that page is move to the top.

If a page is entered in cache first, it is first candidate to go out if it not accessed again in before cache is full and cache miss happens.

Here we will describe the implementation of LRU Cache in Java.