Hashing

Hash function

  • Given a key k, use a hash function to compute a number (called a hash value) that represents k (we will cover hash functions and their design in more detail in the next section of the text). This process of using a hash function to compute a hash value for some key k is called Hashing
  • Then, we can use the hash value of k to compute an index in some array in which we will store k, which is the idea behind a Hash Table
  • 2 properties that describe hash function:

    • Property of Equality: Given two keys k and l, if k and l are equal, h(k) must equal h(l). In other words, if two keys are equal, they must have the same hash value
    • Property of Inequality: Given two keys k and l, if k and l are not equal, it would be nice if h(k) was not equal to h(l). In other words, if two keys are not equal, it would be nice (but not necessary!) for them to have different hash values
  • Therefore, a good hash function for strings of length k (or generally, lists/containers/etc. with k elements) must have a time complexity of O(k) and must perform arithmetic that is non-commutative.

  • valid hash function should satisfy property of equality.

h(k) = 10 // even this is valid. because for each input k it will return the same value.

Hash Table

  • Whenever we say "O(1)" or "constant time" with regard to the average-case time complexity of a Hash Table, this is ignoring the time complexity of the hash function.
  • For primitive data types, hash functions are constant time, but for types that are collections of other types (e.g. lists, strings, etc.), good hash functions iterate over all the elements in the collection. For example, a good hash function for strings of length k would iterate over all k characters in the string, making it O(k).
  • A Hash Table is implemented by an array, and more formally, we define a Hash Table to be an array of size M (which we call the Hash Table's capacity) that stores keys k by using a hash function to compute an index in the array at which to store k.
  • In other words, if we will be inserting N elements into our Hash Table, in order to keep the make the expected number of collisions equal to 1, we need our Hash Table to have a capacity M=O(N^2)
  • always choosing the capacity of our Hash Table to be a prime number. By definition, a prime number is a natural number that that has no positive divisors other than 1 and itself. Consequently, modding by a prime number will guarantee that there are no factors, other than 1 and the prime number itself, that will cause the mod function to never return some index values.
  • If we end up inserting more keys than we expected, or if we didn't have a good estimate of N, we should make sure that our load factor α=N/M never exceeds 0.75 or so.

Collision Resolution

  • Linear Probing: The idea behind this strategy is extremely simple: if an object key maps to an index that is already occupied, simply shift over and try the next available index.
  • Separate Chaining: The basic idea behind Separate Chaining is to keep pointers to Linked Lists as the keys in our Hash Table. Moreover, since we are now allowing multiple keys to be at a single index, we like to say that Separate Chaining is an open hashing collision resolution strategy.

    • ways of eliminating duplicates: check in insert(search the whole linked list whenever we do insertion), check in delete(delete all instance of the item we are going to delete. this is better)
  • Cuckoo Hashing: if an inserting key collides with a key already in the Hash Table, the inserting key pushes out the key in its way and takes its place. The displaced key then hashes to a new location.

Hash Maps

  • The functionality we have described, in which we query a key and receive a value, is defined in the Map ADT, which we will formally describe very soon. After discussing the Map ADT, we will introduce a data structure that utilizes the techniques we used in Hash Tables to implement the Map ADT: the Hash Map
  • Question: what is the difference between hash map and hash table?
Ans:
Hash Table is an array which each element is connected to a linked list. 
The key would be an integer, which will mod M to get index.
However, in a HashMap, we can store <key, value> pair, while key can be any type.

A Hash Map is effectively just an extension of a Hash Table, 
where we do all of the Hash Table operations on the key, 
but when we actually perform the insertion, we perform the (key, value) pair.
  • The Map ADT can theoretically be implemented in a multitude of ways. For example, we could implement it as a Binary Search Tree: we would store two items inside each node, the key and the value, and would keep the Binary Search Tree ordering property based on just keys.
  • if we didn't care so much about the sorting property but rather wanted faster put and has operations (if we desired a constant time-complexity, for example), then the Map ADT could also be implemented effectively as a Hash Table: we refer to this implementation as a Hash Map.
  • REFERECE: Stepik 5

results matching ""

    No results matching ""