SlideShare a Scribd company logo
1 of 10
Download to read offline
Computer Science 385
                                     Analysis of Algorithms
                                     Siena College
                                     Spring 2011


                      Topic Notes: Maps and Hashing


Maps/Dictionaries
A map or dictionary is a structure used for looking up items via key-value associations.
There are many possible implementations of maps. If we think abstractly about a few variants,
we can consider their time complexity on common opertions and space complexity. In the table
below, n denotes the actual number of elements in the map, and N denote the maximum number
of elements it could contain (where that restriction makes sense).
 Structure                   Search   Insert   Delete   Space
 Linked List                  O(n)    O(1)     O(n)      O(n)
 Sorted Array               O(log n)  O(n)     O(n)     O(N )
 Balanced BST               O(log n) O(log n) O(log n)   O(n)
 Array[KeyRange] of EltType   O(1)    O(1)     O(1)    KeyRange
That last line is very interesting. If we know the range of keys (suppose theyā€™re all integers between
0 and KeyRange-1) we can use the key as the index directly into an array. And have constant time
access!


Hashing
The map implementation of an array with keys as the subscripts and values as contents makes a
lot of sense. For example, we could store information about members of a team whose uniform
numbers are all within a given range by storing each playerā€™s information in the array location
indexed by their uniform number. If some uniform numbers are unused, weā€™ll have some empty
slots, but it gives us constant time access to any playerā€™s information as long as we know the
uniform number.
However, there are some important restrictions on the use of this representation.
This implementation assumes that the data has a key which is of a type that can be used as an array
subscript (some enumerated type in Pascal, integers in C/C++/Java), which is not always the case.
What if weā€™re trying to store strings or doubles or something more complex?
Moreover, the size requirements for this implementation could be prohibitive. Suppose we wanted
to store 2000 student records indexed by social security number or a Siena ID number. We would
need an array with 1 billion elements!
In cases like this, where most of the entries will be empty, we would like to use a smaller array, but
come up with a good way to map our elements into the array entries.
CS 385                               Analysis of Algorithms                           Spring 2011



Suppose we have a lot of data elements of type E and a set of indices in which we can store data
elements. We would like to obtain a function H: E ā†’ index, which has the properties:

   1. H(elt) can be computed quickly

   2. If elt1 = elt2 then H(elt1) = H(elt2). (i.e., H is a one-to-one function)

This is called a perfect hashing function. Unfortunately, they are difļ¬cult to ļ¬nd unless you know
all possible entries to the table in advance. This is rarely the case.
Instead we use something that behaves well, but not necessarily perfectly.
The goal is to scatter elements through the array randomly so that they wonā€™t attempt to be stored
in the same location, or ā€œbump intoā€ each other.
So we will deļ¬ne a hash function H: Keys ā†’ Addresses (or array indices), and we will call
H(element.key) the home address of element.
Assuming no two distinct elements wind up mapping to the same location, we can now add, search
for, and remove them in O(1) time.
The one thing we can no longer do efļ¬ciently that we could have done with some of the other maps
is to list them in order.
Since the function is not guaranteed to be one-to-one, we canā€™t guarantee that no two distinct
elements map to the same home address.
This suggests two problems to consider:

   1. How to choose a good hashing function?

   2. How to resolve the situation where two different elements are mapped to same home address?


Selecting Hashing Functions
The following quote should be memorized by anyone trying to design a hashing function:
ā€œA given hash function must always be tried on real data in order to ļ¬nd out whether it is effective
or not.ā€
Data which contains unexpected and unfortunate regularities can completely destroy the usefulness
of any hashing function!
We will start by considering hashing functions for integer-valued keys.

Digit selection As the name suggests, this involved choosing digits from certain positions of key
      (e.g., last 3 digits of a SSN).
      Unfortunately it is easy to make a very unfortunate choice. We would need careful analysis
      of the expected keys to see which digits will work best. Likely patterns can cause problems.

                                                 2
CS 385                                Analysis of Algorithms                          Spring 2011



      We want to make sure that the expected keys are at least capable of generating all possible
      table positions.
      For example the ļ¬rst digits of SSNā€™s reļ¬‚ect the region in which they were assigned and hence
      usually would work poorly as a hashing function.
      Even worse, what about the ļ¬rst few digits of a Siena ID number?

Division Here, we let H(key) = key mod TableSize
      This is very efļ¬cient and often gives good results if the TableSize is chosen properly.
      If it is chosen poorly then you can get very poor results. Consider the case where TableSize
      = 28 = 256 and the keys are computed from ASCII equivalent of two letter pairs, i.e.
      Key(xy) = 28 Ā· ASCII(x) + ASCII(y), then all pairs ending with the same letter get
      mapped to same address. Similar problems arise with any power of 2.
      It is usually safest to choose the TableSize to be a prime number in the range of your
      desired table size. The default size in the structure packageā€™s hash table is 997.

Mid-Square Here, the approach is to square the key and then select a subset of the bits from the
     result. Often, bits from the middle part of the square are taken. The mixing provided by the
     multiplication ensures that all digits are used in the computation of the hash code.
      Example: Let the keys range between 1 and 32000 and let the TableSize be 2048 = 211 .
      Square the key and select 11 bits from the middle of the result. Note that such bit manipula-
      tions can be done very efļ¬ciently using shifting and bitwise and operations.

      H(Key) = (key >> 10) & 0x08FF;

      Will pull out the bits:

      Key              = 123456789abcdefghijklmnopqrstuvw
      Key >> 10        = 0000000000123456789abcdefghijklm
      & 0x08FF         = 000000000000000000000cdefghijklm

      Using r bits produces a table of size 2r .

Folding Here, we break the key into chunks of digits (sometimes reversing alternating chunks)
     and add them up.
      This is often used if the key is too big. Suppose the keys are Social Security numbers, which
      are 9 digits numbers. We can break it up into three pieces, perhaps the 1st digit, the next 4,
      and then the last 4. Then add them together.
      This technique is often used in conjunction with other methods ā€“ one might do a folding
      step, followed by division.




                                                   3
CS 385                                Analysis of Algorithms                             Spring 2011



What if the key is a String?
We can use a formula like - Key(xy) = 28 Ā· (int)x + (int)y to convert from alphabetic keys to
ASCII equivalents. (Use 216 for Unicode.) This is often used in combination with folding (for the
rest of the string) and division.
Here is a very simple-minded hash code for strings: Add together the ordinal equivalent of all
letters and take the remainder mod TableSize.
However, this has a potential problem: words with same letters get mapped to same places:

      miles, slime, smile

This would be much improved if you took the letters in pairs before division.
Here is a function which adds up the ASCII codes of letters and then takes the remainder when
divided by TableSize:

hash = 0;
for (int charNo = 0; charNo < word.length(); charNo++)
    hash = hash + (int)(word.charAt(CharNo));
hash = hash % tableSize; /* gives 0 <= hash < tableSize */

This is not going to be a very good hash function, but itā€™s at least simple.
All Java objects come equipped with a hashCode method (weā€™ll look at this soon). The hashCode
for Java Strings is deļ¬ned as speciļ¬ed in its Javadoc:

       s[0]*31Ė†(n-1) + s[1]*31Ė†(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string,
and Ė† indicates exponentiation. (The hash value of the empty string is zero.)
See Example:
Ėœjteresco/shared/cs385/examples/HashCodes


Handling Collisions
The home address of a key is the location that the hash function returns for that key.
A hash collision occurs when two different keys have the same home location.
There are two main options for getting out of trouble:

   1. Open addressing, which Levitin most confusingly calls closed hashing: if the desired slot is
      full, do something to ļ¬nd an open slot. This must be done in such a way that one can ļ¬nd
      the element again quickly!

                                                  4
CS 385                                Analysis of Algorithms                           Spring 2011



    2. External chaining, which Levitin calls open hashing or separate chaining: make each slot
       capable of holding multiple items and store all objects that with the same home address in
       their desired slot.

For our discussion, we will assume our hash table consists of an array of TableSize entries of
the appropriate type:

     T [] table = new T[TableSize];



Open Addressing
Here, we ļ¬nd the home address of the key (using the hash function). If it happens to be occupied,
we keep trying new slots until an empty slot is located.
There will be three types of entries in the table:

    ā€¢ an empty slot, represented by null

    ā€¢ deleted entries - marked by inserting a special ā€œreserved objectā€ (the need for this will soon
      become obvious), and

    ā€¢ normal entries which contain a reference to an object in the hash table.

This approach requires a rehashing. There are many ways to do this ā€“ we will consider a few.
First, linear probing. We simply look for the next available slot in the array beyond the home
address. If we get to the end, we wrap around to the beginning:

Rehash(i) = (i + 1) % TableSize.

This is about as simple as possible. Successive rehashing will eventually try every slot in the table
for an empty slot.
For example, consider a simple hash table of strings, where TableSize = 26, and we choose
a (poor) hash function:
H(key) = the alphabetic position (zero-based) of the ļ¬rst character in the string.
Strings to be input are GA, D, A, G, A2, A1, A3, A4, Z, ZA, E
Hereā€™s what happens with linear rehashing:

0     1     2      3     4     5      6     7        8    9   10          ...         25
A     A2    A1     D     A3    A4     GA    G        ZA   E   ..          ...         Z



                                                     5
CS 385                                Analysis of Algorithms                             Spring 2011



In this example, we see the potential problem with an open addressing approach in the presence of
collisions: clustering.
Primary clustering occurs when more than one key has the same home address. If the same re-
hashing scheme is used for all keys with the same home address then any new key will collide with
all earlier keys with the same home address when the rehashing scheme is employed.
In the example, this happened with A, A2, A1, A3, and A4.
Secondary clustering occurs when a key has a home address which is occupied by an element
which originally got hashed to a different home address, but in rehashing got moved to the address
which is the same as the home address of the new element.
In the example, this happened with E.
When we search for A3, we need to look in 0,1,2,3,4.
What if we search for AB? When do we know we will not ļ¬nd it? We have to continue our search
to the ļ¬rst empty slot! This is not until position 10!
Now letā€™s delete A2 and then search for A1. If we just put a null in slot 1, we would not ļ¬nd A1
before encountering the null in slot 1.
So we must mark deletions (not just make them empty). These slots can be reused on a new
insertion, but we cannot stop a search when we see a slot marked as deleted. This can be pretty
complicated.
Minor variants of linear rehashing would include adding any number k (which is relatively prime
to TableSize), rather than adding 1.
If the number k is divisible by any factor of TableSize (i.e., k is not relatively prime to
TableSize), then not all entries of the table will be explored when rehashing. For instance,
if TableSize = 100 and k = 50, the linear rehashing function will only explore two slots no
matter how many times it is applied to a starting location.
Often the use of k = 1 works as well as any other choice.
Next, we consider quadratic rehashing.
Here, we try slot (home + j 2 ) % TableSize on the j th rehash.
This variant can help with secondary clustering but not primary clustering.
It can also result in instances where in rehashing you donā€™t try all possible slots in table.
For example, suppose the TableSize is 5, with slots 0 to 4. If the home address of an element
is 1, then successive rehashings result in 2, 0, 0, 2, 1, 2, 0, 0, ... The slots 3 and 4 will never be
examined to see if they have room. This is clearly a disadvantage.
Our last option for rehashing is called double hashing.
Rather than computing a uniform jump size for successive rehashes, we make it depend on the key
by using a different hashing function to calculate the rehash.


                                                   6
CS 385                                Analysis of Algorithms                               Spring 2011



For example, we might compute

delta(Key) = Key % (TableSize -2) + 1

and add this delta for successive rehash attempts.
If the TableSize is chosen well, this should alleviate both primary and secondary clustering.
For example, suppose the TableSize is 5, and H(n) = n % 5. We calculate delta as above.
So while H(1) = H(6) = H(11) = 1, the jump sizes for the rehash differ since delta(1)
= 2, delta(6) = 1, and delta(11) = 3.

External Chaining
We can eliminate some of our troubles entirely by allowing each slot in the hash table hold as many
items as necessary.
The easiest way to do this is to let each slot be the head of a linked list of elements.
The simplest way to represent this is to allocate a table as an array of references, with each non-
null entry referring to a linked list of elements which got mapped to that slot.
We can organize these lists as ordered, singly or doubly linked, circular, etc.
We can even let them be binary search trees if we want.
Of course with good hashing functions, the size of lists should be short enough so that there is no
need for fancy representations (though one may wish to hold them as ordered lists).
There are some advantages to this over open addressing:

   1. The ā€œdeletion problemā€ doesnā€™t arise.

   2. The number of elements stored in the table can be larger than the table size.

   3. It avoids all problems of secondary clustering (though primary clustering can still be a prob-
      lem ā€“ resulting in long lists in some buckets).



Analysis
We can measure the performance of various hashing schemes with respect to the load factor of the
table.
The load factor, Ī±, of a table is deļ¬ned as the ratio of the number of elements stored in the table to
the size of the table.
Ī± = 1 means the table is full, Ī± = 0 means it is empty.
Larger values of Ī± lead to more collisions.

                                                   7
CS 385                                Analysis of Algorithms                         Spring 2011



(Note that with external chaining, it is possible to have Ī± > 1).
The table below (from Baileyā€™s Java Structures) summarizes the performance of our collision res-
olution techniques in searching for an element.
     Strategy          Unsuccessful         Successful
                       1       1        1              1
  Linear probing       2
                        1 + (1āˆ’Ī±)2      2
                                              1 + (1āˆ’Ī±)
                             1               1        1
  Double hashing            1āˆ’Ī±              Ī±
                                                ln (1āˆ’Ī±)
                                                    1
 External chaining         Ī±+e āˆ’Ī±
                                               1 + 2Ī±
The value in each slot represents the average number of compares necessary for a search. The
ļ¬rst column represents the number of compares if the search is ultimately unsuccessful, while the
second represents the case when the item is found.
The main point to note is that the time for linear rehashing goes up dramatically as Ī± approaches
1.
Double hashing is similar, but not so bad, whereas external chaining does not increase very rapidly
at all (linearly).
In particular, if Ī± = .9, we get:
     Strategy         Unsuccessful Successful
                                       11
  Linear probing          55            2
  Double hashing          10          āˆ¼4
 External chaining         3         1.45
The differences become greater with heavier loading.
The space requirements (in words of memory) are roughly the same for both techniques. If we
have n items stored in a table allocated with TableSize entries:

   ā€¢ open addressing: TableSize + nĀ· objectsize
   ā€¢ external chaining: TableSize + nĀ· (objectsize + 1)

With external chaining we can get away with a smaller TableSize, since it responds better to
the resulting higher load factor.
A general rule of thumb might be to use open addressing when we have a small number of elements
and expect a low load factor. We have have a larger number of elements, external chaining can be
more appropriate.


Hashtable Implementations
Hashing is a central enough idea that all Objects in Java come equipped with a method hashCode
that will compute a hash function. Like equals, this is required by Object. We can override it
if we know how we want to have our speciļ¬c types of objects behave.
Java includes hash tables in java.util.Hashtable.

                                                     8
CS 385                               Analysis of Algorithms                           Spring 2011



It uses external chaining, with the buckets implemented as a linear structure. It allows two param-
eters to be set through arguments to the constructor:

   ā€¢ initialCapacity - how big is the array of buckets

   ā€¢ loadFactor - how full can the table get before it automatically grows

Suppose an insertion causes the hash table to exceed its load factor. What does the resize operation
entail? Every item in the hash table must be rehashed! If only a small number of buckets get used,
this could be a problem. We could be resizing when most buckets are still empty. Moreover, the
resize and rehash may or may not help. In fact, shrinking the table might be just as effective as
growing it in some circumstances.
Next, we will consider the hash table implementations in the structure package.

The Hashtable Class
Interesting features:

   ā€¢ Implementation of open addressing with default size 997.

   ā€¢ The load factor threshold for a rehashing is ļ¬xed at 0.6. Exceeding this threshold will result
     in the table being doubled in size and all entries being rehashed.

   ā€¢ There is a special RESERVED entry to make sure we can still locate values after we have
     removed items that caused clustering.

   ā€¢ The protected locate method ļ¬nds the entry where a given key is stored in the table or
     an empty slot where it should be stored if it is to be added. Note that we need to search
     beyond the ļ¬rst RESERVED location because weā€™re not sure yet if we might still ļ¬nd the
     actual value. But we do want to use the ļ¬rst reserved value we come across to minimize
     search distance for this key later.

   ā€¢ Now get and put become simple ā€“ most of the work is done in locate.


The ChainedHashtable Class
Interesting features:

   ā€¢ Implementation of external chaining, default size 997.

   ā€¢ We create an array of buckets that holds Lists of Associations. SinglyLinkedLists
     are used.




                                                 9
CS 385                               Analysis of Algorithms                          Spring 2011



  ā€¢ Lists are allocated when we try to locate the bucket for a key, even if weā€™re not inserting.
    This allows other operations to degenerate into list operations.
    For example, to check if a key is in the table, we locate its bucket and use the listā€™s contains
    method.

  ā€¢ We add an entry with the put method.

         1. we locate the correct bucket
         2. we create a new Association to hold the new item
         3. we remove the item (!) from the bucket
         4. we add the item to the bucket (why?)
         5. if we actually removed a value we return the old one
         6. otherwise, we increase the count and return null

  ā€¢ The get method is also in a sense destructive. Rather than just looking up the value, we
    remove and reinsert the key if itā€™s found.

  ā€¢ This reinsertion for put and get means that we will move the values we are using to the front
    of their bucket. Helpful?




                                               10

More Related Content

What's hot

4.4 hashing
4.4 hashing4.4 hashing
4.4 hashingKrish_ver2
Ā 
Algorithms notes tutorials duniya
Algorithms notes   tutorials duniyaAlgorithms notes   tutorials duniya
Algorithms notes tutorials duniyaTutorialsDuniya.com
Ā 
Unit 8 searching and hashing
Unit   8 searching and hashingUnit   8 searching and hashing
Unit 8 searching and hashingDabbal Singh Mahara
Ā 
Hash tables
Hash tablesHash tables
Hash tablesRajendran
Ā 
Unit viii searching and hashing
Unit   viii searching and hashing Unit   viii searching and hashing
Unit viii searching and hashing Tribhuvan University
Ā 
Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructurerajshreemuthiah
Ā 
Data Structure and Algorithms Hashing
Data Structure and Algorithms HashingData Structure and Algorithms Hashing
Data Structure and Algorithms HashingManishPrajapati78
Ā 
Hashing
HashingHashing
Hashingamoldkul
Ā 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashingRafi Dar
Ā 
Hashing Techniques in Data Structures Part2
Hashing Techniques in Data Structures Part2Hashing Techniques in Data Structures Part2
Hashing Techniques in Data Structures Part2SHAKOOR AB
Ā 
Advance algorithm hashing lec II
Advance algorithm hashing lec IIAdvance algorithm hashing lec II
Advance algorithm hashing lec IISajid Marwat
Ā 
Application of hashing in better alg design tanmay
Application of hashing in better alg design tanmayApplication of hashing in better alg design tanmay
Application of hashing in better alg design tanmayTanmay 'Unsinkable'
Ā 
Hashing Algorithm
Hashing AlgorithmHashing Algorithm
Hashing AlgorithmHayi Nukman
Ā 

What's hot (20)

4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
Ā 
Algorithms notes tutorials duniya
Algorithms notes   tutorials duniyaAlgorithms notes   tutorials duniya
Algorithms notes tutorials duniya
Ā 
Hashing
HashingHashing
Hashing
Ā 
Hashing
HashingHashing
Hashing
Ā 
08 Hash Tables
08 Hash Tables08 Hash Tables
08 Hash Tables
Ā 
Unit 8 searching and hashing
Unit   8 searching and hashingUnit   8 searching and hashing
Unit 8 searching and hashing
Ā 
Hash tables
Hash tablesHash tables
Hash tables
Ā 
Unit viii searching and hashing
Unit   viii searching and hashing Unit   viii searching and hashing
Unit viii searching and hashing
Ā 
Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructure
Ā 
Data Structure and Algorithms Hashing
Data Structure and Algorithms HashingData Structure and Algorithms Hashing
Data Structure and Algorithms Hashing
Ā 
Hashing
HashingHashing
Hashing
Ā 
Hashing 1
Hashing 1Hashing 1
Hashing 1
Ā 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
Ā 
Hashing Techniques in Data Structures Part2
Hashing Techniques in Data Structures Part2Hashing Techniques in Data Structures Part2
Hashing Techniques in Data Structures Part2
Ā 
Hashing
HashingHashing
Hashing
Ā 
Advance algorithm hashing lec II
Advance algorithm hashing lec IIAdvance algorithm hashing lec II
Advance algorithm hashing lec II
Ā 
Hashing PPT
Hashing PPTHashing PPT
Hashing PPT
Ā 
Hash tables
Hash tablesHash tables
Hash tables
Ā 
Application of hashing in better alg design tanmay
Application of hashing in better alg design tanmayApplication of hashing in better alg design tanmay
Application of hashing in better alg design tanmay
Ā 
Hashing Algorithm
Hashing AlgorithmHashing Algorithm
Hashing Algorithm
Ā 

Viewers also liked

Sienna 7 heaps
Sienna 7 heapsSienna 7 heaps
Sienna 7 heapschidabdu
Ā 
Sienna 8 countingsorts
Sienna 8 countingsortsSienna 8 countingsorts
Sienna 8 countingsortschidabdu
Ā 
Sienna 6 bst
Sienna 6 bstSienna 6 bst
Sienna 6 bstchidabdu
Ā 
Sienna 5 decreaseandconquer
Sienna 5 decreaseandconquerSienna 5 decreaseandconquer
Sienna 5 decreaseandconquerchidabdu
Ā 
Sienna 10 dynamic
Sienna 10 dynamicSienna 10 dynamic
Sienna 10 dynamicchidabdu
Ā 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffmanchidabdu
Ā 
Sienna 11 graphs
Sienna 11 graphsSienna 11 graphs
Sienna 11 graphschidabdu
Ā 
Algorithm chapter 7
Algorithm chapter 7Algorithm chapter 7
Algorithm chapter 7chidabdu
Ā 

Viewers also liked (8)

Sienna 7 heaps
Sienna 7 heapsSienna 7 heaps
Sienna 7 heaps
Ā 
Sienna 8 countingsorts
Sienna 8 countingsortsSienna 8 countingsorts
Sienna 8 countingsorts
Ā 
Sienna 6 bst
Sienna 6 bstSienna 6 bst
Sienna 6 bst
Ā 
Sienna 5 decreaseandconquer
Sienna 5 decreaseandconquerSienna 5 decreaseandconquer
Sienna 5 decreaseandconquer
Ā 
Sienna 10 dynamic
Sienna 10 dynamicSienna 10 dynamic
Sienna 10 dynamic
Ā 
Sienna 12 huffman
Sienna 12 huffmanSienna 12 huffman
Sienna 12 huffman
Ā 
Sienna 11 graphs
Sienna 11 graphsSienna 11 graphs
Sienna 11 graphs
Ā 
Algorithm chapter 7
Algorithm chapter 7Algorithm chapter 7
Algorithm chapter 7
Ā 

Similar to Sienna 9 hashing

Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptxAgonySingh
Ā 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
Ā 
Hashing .pptx
Hashing .pptxHashing .pptx
Hashing .pptxParagAhir1
Ā 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data StructuresSHAKOOR AB
Ā 
Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptxkratika64
Ā 
11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithmfarhankhan89766
Ā 
hashing in data strutures advanced in languae java
hashing in data strutures advanced in languae javahashing in data strutures advanced in languae java
hashing in data strutures advanced in languae javaishasharma835109
Ā 
lecture10.ppt
lecture10.pptlecture10.ppt
lecture10.pptShaistaRiaz4
Ā 
Hashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdfHashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdfJaithoonBibi
Ā 
DS Unit 1.pptx
DS Unit 1.pptxDS Unit 1.pptx
DS Unit 1.pptxchin463670
Ā 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingzukun
Ā 
Ch17 Hashing
Ch17 HashingCh17 Hashing
Ch17 Hashingleminhvuong
Ā 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
Ā 
Data Structures Design Notes.pdf
Data Structures Design Notes.pdfData Structures Design Notes.pdf
Data Structures Design Notes.pdfAmuthachenthiruK
Ā 
Assignment4 Assignment 4 Hashtables In this assignment we w.pdf
Assignment4 Assignment 4 Hashtables In this assignment we w.pdfAssignment4 Assignment 4 Hashtables In this assignment we w.pdf
Assignment4 Assignment 4 Hashtables In this assignment we w.pdfkksrivastava1
Ā 
Data Structure and Algorithms: What is Hash Table ppt
Data Structure and Algorithms: What is Hash Table pptData Structure and Algorithms: What is Hash Table ppt
Data Structure and Algorithms: What is Hash Table pptJUSTFUN40
Ā 
Lecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptxLecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptxSLekshmiNair
Ā 
DSA UNIT II ARRAY AND LIST - notes
DSA UNIT II ARRAY AND LIST - notesDSA UNIT II ARRAY AND LIST - notes
DSA UNIT II ARRAY AND LIST - notesswathirajstar
Ā 
Python - Data Collection
Python - Data CollectionPython - Data Collection
Python - Data CollectionJoseTanJr
Ā 

Similar to Sienna 9 hashing (20)

Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
Ā 
presentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptxpresentation on important DAG,TRIE,Hashing.pptx
presentation on important DAG,TRIE,Hashing.pptx
Ā 
Hashing .pptx
Hashing .pptxHashing .pptx
Hashing .pptx
Ā 
hashing.pdf
hashing.pdfhashing.pdf
hashing.pdf
Ā 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
Ā 
Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptx
Ā 
11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm11_hashtable-1.ppt. Data structure algorithm
11_hashtable-1.ppt. Data structure algorithm
Ā 
hashing in data strutures advanced in languae java
hashing in data strutures advanced in languae javahashing in data strutures advanced in languae java
hashing in data strutures advanced in languae java
Ā 
lecture10.ppt
lecture10.pptlecture10.ppt
lecture10.ppt
Ā 
Hashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdfHashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdf
Ā 
DS Unit 1.pptx
DS Unit 1.pptxDS Unit 1.pptx
DS Unit 1.pptx
Ā 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
Ā 
Ch17 Hashing
Ch17 HashingCh17 Hashing
Ch17 Hashing
Ā 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
Ā 
Data Structures Design Notes.pdf
Data Structures Design Notes.pdfData Structures Design Notes.pdf
Data Structures Design Notes.pdf
Ā 
Assignment4 Assignment 4 Hashtables In this assignment we w.pdf
Assignment4 Assignment 4 Hashtables In this assignment we w.pdfAssignment4 Assignment 4 Hashtables In this assignment we w.pdf
Assignment4 Assignment 4 Hashtables In this assignment we w.pdf
Ā 
Data Structure and Algorithms: What is Hash Table ppt
Data Structure and Algorithms: What is Hash Table pptData Structure and Algorithms: What is Hash Table ppt
Data Structure and Algorithms: What is Hash Table ppt
Ā 
Lecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptxLecture14_15_Hashing.pptx
Lecture14_15_Hashing.pptx
Ā 
DSA UNIT II ARRAY AND LIST - notes
DSA UNIT II ARRAY AND LIST - notesDSA UNIT II ARRAY AND LIST - notes
DSA UNIT II ARRAY AND LIST - notes
Ā 
Python - Data Collection
Python - Data CollectionPython - Data Collection
Python - Data Collection
Ā 

More from chidabdu

Sienna 4 divideandconquer
Sienna 4 divideandconquerSienna 4 divideandconquer
Sienna 4 divideandconquerchidabdu
Ā 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforcechidabdu
Ā 
Sienna 2 analysis
Sienna 2 analysisSienna 2 analysis
Sienna 2 analysischidabdu
Ā 
Sienna 1 intro
Sienna 1 introSienna 1 intro
Sienna 1 introchidabdu
Ā 
Sienna 13 limitations
Sienna 13 limitationsSienna 13 limitations
Sienna 13 limitationschidabdu
Ā 
Unit 3 basic processing unit
Unit 3   basic processing unitUnit 3   basic processing unit
Unit 3 basic processing unitchidabdu
Ā 
Unit 5 I/O organization
Unit 5   I/O organizationUnit 5   I/O organization
Unit 5 I/O organizationchidabdu
Ā 
Algorithm chapter 1
Algorithm chapter 1Algorithm chapter 1
Algorithm chapter 1chidabdu
Ā 
Algorithm chapter 11
Algorithm chapter 11Algorithm chapter 11
Algorithm chapter 11chidabdu
Ā 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10chidabdu
Ā 
Algorithm chapter 9
Algorithm chapter 9Algorithm chapter 9
Algorithm chapter 9chidabdu
Ā 
Algorithm chapter 8
Algorithm chapter 8Algorithm chapter 8
Algorithm chapter 8chidabdu
Ā 
Algorithm chapter 6
Algorithm chapter 6Algorithm chapter 6
Algorithm chapter 6chidabdu
Ā 
Algorithm chapter 5
Algorithm chapter 5Algorithm chapter 5
Algorithm chapter 5chidabdu
Ā 
Algorithm chapter 2
Algorithm chapter 2Algorithm chapter 2
Algorithm chapter 2chidabdu
Ā 
Algorithm review
Algorithm reviewAlgorithm review
Algorithm reviewchidabdu
Ā 
Unit 2 arithmetics
Unit 2   arithmeticsUnit 2   arithmetics
Unit 2 arithmeticschidabdu
Ā 
Unit 1 basic structure of computers
Unit 1   basic structure of computersUnit 1   basic structure of computers
Unit 1 basic structure of computerschidabdu
Ā 
Unit 4 memory system
Unit 4   memory systemUnit 4   memory system
Unit 4 memory systemchidabdu
Ā 

More from chidabdu (19)

Sienna 4 divideandconquer
Sienna 4 divideandconquerSienna 4 divideandconquer
Sienna 4 divideandconquer
Ā 
Sienna 3 bruteforce
Sienna 3 bruteforceSienna 3 bruteforce
Sienna 3 bruteforce
Ā 
Sienna 2 analysis
Sienna 2 analysisSienna 2 analysis
Sienna 2 analysis
Ā 
Sienna 1 intro
Sienna 1 introSienna 1 intro
Sienna 1 intro
Ā 
Sienna 13 limitations
Sienna 13 limitationsSienna 13 limitations
Sienna 13 limitations
Ā 
Unit 3 basic processing unit
Unit 3   basic processing unitUnit 3   basic processing unit
Unit 3 basic processing unit
Ā 
Unit 5 I/O organization
Unit 5   I/O organizationUnit 5   I/O organization
Unit 5 I/O organization
Ā 
Algorithm chapter 1
Algorithm chapter 1Algorithm chapter 1
Algorithm chapter 1
Ā 
Algorithm chapter 11
Algorithm chapter 11Algorithm chapter 11
Algorithm chapter 11
Ā 
Algorithm chapter 10
Algorithm chapter 10Algorithm chapter 10
Algorithm chapter 10
Ā 
Algorithm chapter 9
Algorithm chapter 9Algorithm chapter 9
Algorithm chapter 9
Ā 
Algorithm chapter 8
Algorithm chapter 8Algorithm chapter 8
Algorithm chapter 8
Ā 
Algorithm chapter 6
Algorithm chapter 6Algorithm chapter 6
Algorithm chapter 6
Ā 
Algorithm chapter 5
Algorithm chapter 5Algorithm chapter 5
Algorithm chapter 5
Ā 
Algorithm chapter 2
Algorithm chapter 2Algorithm chapter 2
Algorithm chapter 2
Ā 
Algorithm review
Algorithm reviewAlgorithm review
Algorithm review
Ā 
Unit 2 arithmetics
Unit 2   arithmeticsUnit 2   arithmetics
Unit 2 arithmetics
Ā 
Unit 1 basic structure of computers
Unit 1   basic structure of computersUnit 1   basic structure of computers
Unit 1 basic structure of computers
Ā 
Unit 4 memory system
Unit 4   memory systemUnit 4   memory system
Unit 4 memory system
Ā 

Recently uploaded

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
Ā 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
Ā 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
Ā 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
Ā 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
Ā 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
Ā 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
Ā 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
Ā 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
Ā 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
Ā 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
Ā 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
Ā 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
Ā 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
Ā 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
Ā 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
Ā 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
Ā 

Recently uploaded (20)

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
Ā 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
Ā 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
Ā 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Ā 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
Ā 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
Ā 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Ā 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
Ā 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
Ā 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
Ā 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Ā 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Ā 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
Ā 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
Ā 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
Ā 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
Ā 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 

Sienna 9 hashing

  • 1. Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Maps and Hashing Maps/Dictionaries A map or dictionary is a structure used for looking up items via key-value associations. There are many possible implementations of maps. If we think abstractly about a few variants, we can consider their time complexity on common opertions and space complexity. In the table below, n denotes the actual number of elements in the map, and N denote the maximum number of elements it could contain (where that restriction makes sense). Structure Search Insert Delete Space Linked List O(n) O(1) O(n) O(n) Sorted Array O(log n) O(n) O(n) O(N ) Balanced BST O(log n) O(log n) O(log n) O(n) Array[KeyRange] of EltType O(1) O(1) O(1) KeyRange That last line is very interesting. If we know the range of keys (suppose theyā€™re all integers between 0 and KeyRange-1) we can use the key as the index directly into an array. And have constant time access! Hashing The map implementation of an array with keys as the subscripts and values as contents makes a lot of sense. For example, we could store information about members of a team whose uniform numbers are all within a given range by storing each playerā€™s information in the array location indexed by their uniform number. If some uniform numbers are unused, weā€™ll have some empty slots, but it gives us constant time access to any playerā€™s information as long as we know the uniform number. However, there are some important restrictions on the use of this representation. This implementation assumes that the data has a key which is of a type that can be used as an array subscript (some enumerated type in Pascal, integers in C/C++/Java), which is not always the case. What if weā€™re trying to store strings or doubles or something more complex? Moreover, the size requirements for this implementation could be prohibitive. Suppose we wanted to store 2000 student records indexed by social security number or a Siena ID number. We would need an array with 1 billion elements! In cases like this, where most of the entries will be empty, we would like to use a smaller array, but come up with a good way to map our elements into the array entries.
  • 2. CS 385 Analysis of Algorithms Spring 2011 Suppose we have a lot of data elements of type E and a set of indices in which we can store data elements. We would like to obtain a function H: E ā†’ index, which has the properties: 1. H(elt) can be computed quickly 2. If elt1 = elt2 then H(elt1) = H(elt2). (i.e., H is a one-to-one function) This is called a perfect hashing function. Unfortunately, they are difļ¬cult to ļ¬nd unless you know all possible entries to the table in advance. This is rarely the case. Instead we use something that behaves well, but not necessarily perfectly. The goal is to scatter elements through the array randomly so that they wonā€™t attempt to be stored in the same location, or ā€œbump intoā€ each other. So we will deļ¬ne a hash function H: Keys ā†’ Addresses (or array indices), and we will call H(element.key) the home address of element. Assuming no two distinct elements wind up mapping to the same location, we can now add, search for, and remove them in O(1) time. The one thing we can no longer do efļ¬ciently that we could have done with some of the other maps is to list them in order. Since the function is not guaranteed to be one-to-one, we canā€™t guarantee that no two distinct elements map to the same home address. This suggests two problems to consider: 1. How to choose a good hashing function? 2. How to resolve the situation where two different elements are mapped to same home address? Selecting Hashing Functions The following quote should be memorized by anyone trying to design a hashing function: ā€œA given hash function must always be tried on real data in order to ļ¬nd out whether it is effective or not.ā€ Data which contains unexpected and unfortunate regularities can completely destroy the usefulness of any hashing function! We will start by considering hashing functions for integer-valued keys. Digit selection As the name suggests, this involved choosing digits from certain positions of key (e.g., last 3 digits of a SSN). Unfortunately it is easy to make a very unfortunate choice. We would need careful analysis of the expected keys to see which digits will work best. Likely patterns can cause problems. 2
  • 3. CS 385 Analysis of Algorithms Spring 2011 We want to make sure that the expected keys are at least capable of generating all possible table positions. For example the ļ¬rst digits of SSNā€™s reļ¬‚ect the region in which they were assigned and hence usually would work poorly as a hashing function. Even worse, what about the ļ¬rst few digits of a Siena ID number? Division Here, we let H(key) = key mod TableSize This is very efļ¬cient and often gives good results if the TableSize is chosen properly. If it is chosen poorly then you can get very poor results. Consider the case where TableSize = 28 = 256 and the keys are computed from ASCII equivalent of two letter pairs, i.e. Key(xy) = 28 Ā· ASCII(x) + ASCII(y), then all pairs ending with the same letter get mapped to same address. Similar problems arise with any power of 2. It is usually safest to choose the TableSize to be a prime number in the range of your desired table size. The default size in the structure packageā€™s hash table is 997. Mid-Square Here, the approach is to square the key and then select a subset of the bits from the result. Often, bits from the middle part of the square are taken. The mixing provided by the multiplication ensures that all digits are used in the computation of the hash code. Example: Let the keys range between 1 and 32000 and let the TableSize be 2048 = 211 . Square the key and select 11 bits from the middle of the result. Note that such bit manipula- tions can be done very efļ¬ciently using shifting and bitwise and operations. H(Key) = (key >> 10) & 0x08FF; Will pull out the bits: Key = 123456789abcdefghijklmnopqrstuvw Key >> 10 = 0000000000123456789abcdefghijklm & 0x08FF = 000000000000000000000cdefghijklm Using r bits produces a table of size 2r . Folding Here, we break the key into chunks of digits (sometimes reversing alternating chunks) and add them up. This is often used if the key is too big. Suppose the keys are Social Security numbers, which are 9 digits numbers. We can break it up into three pieces, perhaps the 1st digit, the next 4, and then the last 4. Then add them together. This technique is often used in conjunction with other methods ā€“ one might do a folding step, followed by division. 3
  • 4. CS 385 Analysis of Algorithms Spring 2011 What if the key is a String? We can use a formula like - Key(xy) = 28 Ā· (int)x + (int)y to convert from alphabetic keys to ASCII equivalents. (Use 216 for Unicode.) This is often used in combination with folding (for the rest of the string) and division. Here is a very simple-minded hash code for strings: Add together the ordinal equivalent of all letters and take the remainder mod TableSize. However, this has a potential problem: words with same letters get mapped to same places: miles, slime, smile This would be much improved if you took the letters in pairs before division. Here is a function which adds up the ASCII codes of letters and then takes the remainder when divided by TableSize: hash = 0; for (int charNo = 0; charNo < word.length(); charNo++) hash = hash + (int)(word.charAt(CharNo)); hash = hash % tableSize; /* gives 0 <= hash < tableSize */ This is not going to be a very good hash function, but itā€™s at least simple. All Java objects come equipped with a hashCode method (weā€™ll look at this soon). The hashCode for Java Strings is deļ¬ned as speciļ¬ed in its Javadoc: s[0]*31Ė†(n-1) + s[1]*31Ė†(n-2) + ... + s[n-1] using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and Ė† indicates exponentiation. (The hash value of the empty string is zero.) See Example: Ėœjteresco/shared/cs385/examples/HashCodes Handling Collisions The home address of a key is the location that the hash function returns for that key. A hash collision occurs when two different keys have the same home location. There are two main options for getting out of trouble: 1. Open addressing, which Levitin most confusingly calls closed hashing: if the desired slot is full, do something to ļ¬nd an open slot. This must be done in such a way that one can ļ¬nd the element again quickly! 4
  • 5. CS 385 Analysis of Algorithms Spring 2011 2. External chaining, which Levitin calls open hashing or separate chaining: make each slot capable of holding multiple items and store all objects that with the same home address in their desired slot. For our discussion, we will assume our hash table consists of an array of TableSize entries of the appropriate type: T [] table = new T[TableSize]; Open Addressing Here, we ļ¬nd the home address of the key (using the hash function). If it happens to be occupied, we keep trying new slots until an empty slot is located. There will be three types of entries in the table: ā€¢ an empty slot, represented by null ā€¢ deleted entries - marked by inserting a special ā€œreserved objectā€ (the need for this will soon become obvious), and ā€¢ normal entries which contain a reference to an object in the hash table. This approach requires a rehashing. There are many ways to do this ā€“ we will consider a few. First, linear probing. We simply look for the next available slot in the array beyond the home address. If we get to the end, we wrap around to the beginning: Rehash(i) = (i + 1) % TableSize. This is about as simple as possible. Successive rehashing will eventually try every slot in the table for an empty slot. For example, consider a simple hash table of strings, where TableSize = 26, and we choose a (poor) hash function: H(key) = the alphabetic position (zero-based) of the ļ¬rst character in the string. Strings to be input are GA, D, A, G, A2, A1, A3, A4, Z, ZA, E Hereā€™s what happens with linear rehashing: 0 1 2 3 4 5 6 7 8 9 10 ... 25 A A2 A1 D A3 A4 GA G ZA E .. ... Z 5
  • 6. CS 385 Analysis of Algorithms Spring 2011 In this example, we see the potential problem with an open addressing approach in the presence of collisions: clustering. Primary clustering occurs when more than one key has the same home address. If the same re- hashing scheme is used for all keys with the same home address then any new key will collide with all earlier keys with the same home address when the rehashing scheme is employed. In the example, this happened with A, A2, A1, A3, and A4. Secondary clustering occurs when a key has a home address which is occupied by an element which originally got hashed to a different home address, but in rehashing got moved to the address which is the same as the home address of the new element. In the example, this happened with E. When we search for A3, we need to look in 0,1,2,3,4. What if we search for AB? When do we know we will not ļ¬nd it? We have to continue our search to the ļ¬rst empty slot! This is not until position 10! Now letā€™s delete A2 and then search for A1. If we just put a null in slot 1, we would not ļ¬nd A1 before encountering the null in slot 1. So we must mark deletions (not just make them empty). These slots can be reused on a new insertion, but we cannot stop a search when we see a slot marked as deleted. This can be pretty complicated. Minor variants of linear rehashing would include adding any number k (which is relatively prime to TableSize), rather than adding 1. If the number k is divisible by any factor of TableSize (i.e., k is not relatively prime to TableSize), then not all entries of the table will be explored when rehashing. For instance, if TableSize = 100 and k = 50, the linear rehashing function will only explore two slots no matter how many times it is applied to a starting location. Often the use of k = 1 works as well as any other choice. Next, we consider quadratic rehashing. Here, we try slot (home + j 2 ) % TableSize on the j th rehash. This variant can help with secondary clustering but not primary clustering. It can also result in instances where in rehashing you donā€™t try all possible slots in table. For example, suppose the TableSize is 5, with slots 0 to 4. If the home address of an element is 1, then successive rehashings result in 2, 0, 0, 2, 1, 2, 0, 0, ... The slots 3 and 4 will never be examined to see if they have room. This is clearly a disadvantage. Our last option for rehashing is called double hashing. Rather than computing a uniform jump size for successive rehashes, we make it depend on the key by using a different hashing function to calculate the rehash. 6
  • 7. CS 385 Analysis of Algorithms Spring 2011 For example, we might compute delta(Key) = Key % (TableSize -2) + 1 and add this delta for successive rehash attempts. If the TableSize is chosen well, this should alleviate both primary and secondary clustering. For example, suppose the TableSize is 5, and H(n) = n % 5. We calculate delta as above. So while H(1) = H(6) = H(11) = 1, the jump sizes for the rehash differ since delta(1) = 2, delta(6) = 1, and delta(11) = 3. External Chaining We can eliminate some of our troubles entirely by allowing each slot in the hash table hold as many items as necessary. The easiest way to do this is to let each slot be the head of a linked list of elements. The simplest way to represent this is to allocate a table as an array of references, with each non- null entry referring to a linked list of elements which got mapped to that slot. We can organize these lists as ordered, singly or doubly linked, circular, etc. We can even let them be binary search trees if we want. Of course with good hashing functions, the size of lists should be short enough so that there is no need for fancy representations (though one may wish to hold them as ordered lists). There are some advantages to this over open addressing: 1. The ā€œdeletion problemā€ doesnā€™t arise. 2. The number of elements stored in the table can be larger than the table size. 3. It avoids all problems of secondary clustering (though primary clustering can still be a prob- lem ā€“ resulting in long lists in some buckets). Analysis We can measure the performance of various hashing schemes with respect to the load factor of the table. The load factor, Ī±, of a table is deļ¬ned as the ratio of the number of elements stored in the table to the size of the table. Ī± = 1 means the table is full, Ī± = 0 means it is empty. Larger values of Ī± lead to more collisions. 7
  • 8. CS 385 Analysis of Algorithms Spring 2011 (Note that with external chaining, it is possible to have Ī± > 1). The table below (from Baileyā€™s Java Structures) summarizes the performance of our collision res- olution techniques in searching for an element. Strategy Unsuccessful Successful 1 1 1 1 Linear probing 2 1 + (1āˆ’Ī±)2 2 1 + (1āˆ’Ī±) 1 1 1 Double hashing 1āˆ’Ī± Ī± ln (1āˆ’Ī±) 1 External chaining Ī±+e āˆ’Ī± 1 + 2Ī± The value in each slot represents the average number of compares necessary for a search. The ļ¬rst column represents the number of compares if the search is ultimately unsuccessful, while the second represents the case when the item is found. The main point to note is that the time for linear rehashing goes up dramatically as Ī± approaches 1. Double hashing is similar, but not so bad, whereas external chaining does not increase very rapidly at all (linearly). In particular, if Ī± = .9, we get: Strategy Unsuccessful Successful 11 Linear probing 55 2 Double hashing 10 āˆ¼4 External chaining 3 1.45 The differences become greater with heavier loading. The space requirements (in words of memory) are roughly the same for both techniques. If we have n items stored in a table allocated with TableSize entries: ā€¢ open addressing: TableSize + nĀ· objectsize ā€¢ external chaining: TableSize + nĀ· (objectsize + 1) With external chaining we can get away with a smaller TableSize, since it responds better to the resulting higher load factor. A general rule of thumb might be to use open addressing when we have a small number of elements and expect a low load factor. We have have a larger number of elements, external chaining can be more appropriate. Hashtable Implementations Hashing is a central enough idea that all Objects in Java come equipped with a method hashCode that will compute a hash function. Like equals, this is required by Object. We can override it if we know how we want to have our speciļ¬c types of objects behave. Java includes hash tables in java.util.Hashtable. 8
  • 9. CS 385 Analysis of Algorithms Spring 2011 It uses external chaining, with the buckets implemented as a linear structure. It allows two param- eters to be set through arguments to the constructor: ā€¢ initialCapacity - how big is the array of buckets ā€¢ loadFactor - how full can the table get before it automatically grows Suppose an insertion causes the hash table to exceed its load factor. What does the resize operation entail? Every item in the hash table must be rehashed! If only a small number of buckets get used, this could be a problem. We could be resizing when most buckets are still empty. Moreover, the resize and rehash may or may not help. In fact, shrinking the table might be just as effective as growing it in some circumstances. Next, we will consider the hash table implementations in the structure package. The Hashtable Class Interesting features: ā€¢ Implementation of open addressing with default size 997. ā€¢ The load factor threshold for a rehashing is ļ¬xed at 0.6. Exceeding this threshold will result in the table being doubled in size and all entries being rehashed. ā€¢ There is a special RESERVED entry to make sure we can still locate values after we have removed items that caused clustering. ā€¢ The protected locate method ļ¬nds the entry where a given key is stored in the table or an empty slot where it should be stored if it is to be added. Note that we need to search beyond the ļ¬rst RESERVED location because weā€™re not sure yet if we might still ļ¬nd the actual value. But we do want to use the ļ¬rst reserved value we come across to minimize search distance for this key later. ā€¢ Now get and put become simple ā€“ most of the work is done in locate. The ChainedHashtable Class Interesting features: ā€¢ Implementation of external chaining, default size 997. ā€¢ We create an array of buckets that holds Lists of Associations. SinglyLinkedLists are used. 9
  • 10. CS 385 Analysis of Algorithms Spring 2011 ā€¢ Lists are allocated when we try to locate the bucket for a key, even if weā€™re not inserting. This allows other operations to degenerate into list operations. For example, to check if a key is in the table, we locate its bucket and use the listā€™s contains method. ā€¢ We add an entry with the put method. 1. we locate the correct bucket 2. we create a new Association to hold the new item 3. we remove the item (!) from the bucket 4. we add the item to the bucket (why?) 5. if we actually removed a value we return the old one 6. otherwise, we increase the count and return null ā€¢ The get method is also in a sense destructive. Rather than just looking up the value, we remove and reinsert the key if itā€™s found. ā€¢ This reinsertion for put and get means that we will move the values we are using to the front of their bucket. Helpful? 10