Design Hashmap

Play this article

HashMap is a data structure that stores key-value pairs in buckets. The specialty of hashmap is it gets any key value in constant time O(1).

The following are important terms in the context of hashmap.

  1. Bucket: Each bucket is nothing but the slot in the array. We can assume the hashmap is also an array with the Entry data type. A hashmap can be said an array of buckets.

  2. Default Size: Hashmap has a default size of 16. The array of bucket sizes is 16. This is to avoid collision.

  3. Load Factor: In a HashMap, the load factor is a parameter that determines when the HashMap should be resized to accommodate more elements. It is a value between 0 and 1, representing the fraction of the HashMap's capacity that can be filled with key-value pairs before the HashMap is automatically resized.

  4. Collision: If two or more keys have the same hashcodes then there is a collision. Collision means that multiple keys are trying to occupy the same bucket.

Working of Hashmap:

A hashmap is an array of buckets and we call Entry in our case.

Entry[] hashTable;

Entry is a generic class. Here each Entry is a node

    class Entry<K,V> {
        K key;
        V value;
        Entry next;

        public Entry(K key, V value) {
            this.key = key;
            this.value = value;
        }

        public K getKey() {
            return key;
        }

        public void setKey(K key) {
            this.key = key;
        }

        public V getValue() {
            return value;
        }

        public void setValue(V value) {
            this.value = value;
        }
    }

Constructors: Within the primary logic of the hashmap, there are two constructor options available. The first is the default constructor, which initializes the hashmap with a size of 16. The second constructor allows you to specify a custom initial capacity.

Reasoning for max size: Hashmap sizes are typically chosen to be powers of two. Since integers have a maximum value of (2 to the power of 31) - 1 and a minimum value of -(2 to the power of 31), the total number of bits allocated to represent them is 32 bits, with the first bit being the sign bit. Given that this number cannot be precisely represented as a power of two, the closest power of two that can accommodate it is 2 to the power of 30.

    private static final int INITIAL_SIZE = 1<<4;
    private static final int MAXIMUM_CAPACITY = 1<<30;    

    public MyHashMap() {
        this.hashTable = new Entry[INITIAL_SIZE];
    }

    public MyHashMap(int capacity){
        int tableSize = tableSizeFor(capacity);
        hashTable = new Entry[tableSize];
    }

    final int tableSizeFor(int cap) {
        int n = cap -1;
        n |= n>>1;
        n |= n>>2;
        n |= n>>4;
        n |= n>>8;
        n |= n>>16;
        return (n<0) ? 1: (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n+1;
    }

Table size for given capacity: The tableSizeFor function calculates the closest power of two that is greater than the input capacity (cap). The above method, which achieves this in just five steps, represents the most efficient approach. Below, a brute-force alternative is provided solely for the sake of comprehension.

    public static int nextPowerOfTwo(int cap) {
        int result = 1;
        while (result < cap) {
            result *= 2;
        }
        return result;
    }

Bit right shift explanation for 8 bits: This explanation is provided using the number 128, but it applies universally to any number. To find the nearest power of 2, we begin by subtracting 1. For instance, consider the number 4. The closest power of 2 is 4. If we were to skip subtracting 1, the result would be 8. Therefore, we subtract 1 to prevent this discrepancy.

Put and get the key:

In the put method, we employed a linked list strategy. When two keys share the same hash code, we append the new key to the end of the existing ones. It's important to note that hash codes are automatically generated for any class in Java. These hash codes are integer values, and we ensure they fall within the bounds of our table size by applying a modulus operation, which calculates the remainder when divided by the table's length.

    public void put(K key, V value) {
        int hashCode = key.hashCode() % hashTable.length;
        Entry node = hashTable[hashCode];

        if (node == null) {
            Entry newNode = new Entry(key, value);
            hashTable[hashCode] = newNode;
        } else {
            Entry previousNode = node;
            while (node != null) {
                if (node.key == key) {
                    node.value = value;
                    return;
                }
                previousNode = node;
                node = node.next;
            }
            Entry newNode = new Entry(key, value);
            previousNode.next = newNode;
        }
    }

    public V get(K key) {
        int hashCode = key.hashCode() % hashTable.length;
        Entry node = hashTable[hashCode];

        while(node!=null){
            if (node.key.equals((key))) {
                return (V) node.value;
            }
            node = node.next;
        }
        return null;
    }

source code (link)

More about Load Factor:

  1. Initially, a HashMap is created with a certain capacity, which is the number of buckets it has to store key-value pairs.

  2. As you add key-value pairs to the HashMap, it keeps track of the number of entries. When the number of entries exceeds a certain threshold based on the load factor, the HashMap is automatically resized (typically, it's doubled in size).

  3. Resizing the HashMap involves creating a new, larger array of buckets and rehashing all the existing key-value pairs into the new array. This process ensures that the distribution of elements remains relatively uniform, which is important for efficient HashMap operations.

  4. A common load factor value is 0.75, which means that the HashMap will be resized when it's filled to 75% of its capacity. This is often considered a good trade-off between memory usage and performance. However, you can adjust the load factor when creating a HashMap to suit your specific requirements.

More about Collision Strategies:

When a collision occurs, the data structure needs a mechanism to resolve it, ensuring that each key can still be retrieved accurately. There are several common ways to handle collisions:

  1. Separate Chaining: In this approach, each bucket in the hash table is implemented as a linked list. When a collision happens, the key-value pair is added to the linked list in that bucket. When you need to retrieve a value associated with a key, you first compute the hash code, go to the appropriate bucket, and then search the linked list within that bucket to find the desired key-value pair.

  2. Open Addressing: In open addressing, when a collision occurs, the data structure looks for the next available (unoccupied) bucket in a predefined sequence, often using methods like linear probing, quadratic probing, or double hashing. This way, the key is placed in the next available bucket, and retrieval involves searching through this sequence until the key is found or an empty bucket is encountered.

  3. Robin Hood Hashing: This is a variation of open addressing where items are moved within the buckets to maintain a relatively balanced distribution of items, reducing the variance in search times.