6.006 5. Hashing I: Chaining, Hash functions
MIT OpenCourseWare http://ocw.mit.edu6.006Introduction to AlgorithmsSpring 2008For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Lecture 5 Hashing I: Chaining, Hash Functions 6.006 Spring 2008Lecture 5: Hashing I: Chaining, Hash Functions Lecture Overview • Dictionaries and PythonMotivation• Hash functions • • Chaining • Simple uniform hashing“Good” hash functions• Readings CLRS Chapter 11. 1, 11. 2, 11. 3. Dictionary Problem Abstract Data Type (ADT) maintains a set of items, each with a key, subject to • insert(item): add item to set • delete(item): remove item from set • search(key): return item with key if it exists • assume items have distinct keys (or that inserting new one clobbers old) • balanced BSTs solve in O(lg n) time per op. (in addition to inexact searches like nextlargest). • goal: O(1) time per operation. Python Dictionaries: Items are (key, value) pairs e.g. d = ‘algorithms’: 5, ‘cool’: 42 d.items() [(‘algorithms’, 5),(‘cool’,5)] →d[‘cool’] 42→d[42] KeyError→‘cool’ in d True →42 in d False → Python set is really dict where items are keys. 1 Lecture 5 Hashing I: Chaining, Hash Functions 6.006 Spring 2008Motivation Document Distance • already used in def count_frequency(word_list):D = {}for word in word_list:if word in D:D[word] += 1else:D[word] = 1• new docdist7 uses dictionaries instead of sorting: def inner_product(D1, D2):sum = φ. φfor key in D1:if key in D2:sum += D1[key]*D2[key]= optimal Θ(n) document distance assuming dictionary ops. take O(1) time ⇒ PS2 How close is chimp DNA to human DNA? = Longest common substring of two strings e.g. ALGORITHM vs. ARITHMETIC. Dictionaries help speed algorithms e.g. put all substrings into set, looking for duplicates -Θ(n2) operations. 2Lecture 5 Hashing I: Chaining, Hash Functions 6.006 Spring 2008How do we solve the dictionary problem? A simple approach would be a direct access table. This means items would need to be stored in an array, indexed by key. 12keykeykeyitemitemitem...Figure 1: Direct-access table Problems: 1. keys must be nonnegative integers (or using two arrays, integers) 2. large key range = large space e.g. one key of 2256 is bad news. ⇒ 2 Solutions: Solution 1 : map key space to integers. • In Python: hash (object) where object is a number, string, tuple, etc. or object implementing — hash — Misnomer: should be called “prehash” Ideally, x = y hash(x) = hash (y)• ⇔ • Python applies some heuristics e.g. hash(‘\φB ’) = 64 = hash(‘\φ \ φC’) • Object’s key should not change while in table (else cannot find it anymore) • No mutable objects like lists 3 Lecture 5 Hashing I: Chaining, Hash Functions 6.006 Spring 2008Solution 2 : hashing (verb from ‘hache’ = hatchet, Germanic) • Reduce universe U of all keys (say, integers) down to reasonable size m for table • idea: m ≈ n, n =| k |, k = keys in dictionary • hash function h: U → φ, 1,...,m − 1 1m-1k23kk1Th(k1) = 1..............Ukkkkk1234Figure 2: Mapping keys to a table • two keys ki,kj �K collide if h(ki)= h(kj ) How do we deal with collisions? There are two ways 1. Chaining: TODAY 2. Open addressing: NEXT LECTURE 4Lecture 5 Hashing I: Chaining, Hash Functions 6.006 Spring 2008Chaining Linked list of colliding elements in each slot of table 1....Ukkkkk1234k...4k.k2k3h(k1) =h(k2) =h(k4) Figure 3: Chaining in a Hash Table • Search must go through whole list T[h(key)] Worst case: all keys in k hash to same slot = Θ(n) per operation •⇒ Simple Uniform Hashing -an Assumption: Each key is equally likely to be hashed to any slot of table, independent of where other keys are hashed. let n = � keys stored in table m = � slots in table load factor α = n/m = average � keys per slot Expected performance of chaining: assuming simple uniform hashing The performance is likely to be O(1 + α) -the 1 comes from applying the hash function and access slot whereas the α comes from searching the list. It is actually Θ(1 + α), even for successful search (see CLRS ). Therefore, the performance is O(1) if α = O(1) i. e. m = Ω(n). 5Lecture 5 Hashing I: Chaining, Hash Functions 6.006 Spring 2008HashFunctions Division Method: h(k)= k mod m • k1 and k2 collide when k1 = k2( mod m) i. e. when m divides | k1 − k2 | • fine if keys you store are uniform random • but if keys are x, 2x, 3x, . . . (regularity) and x and m have common divisor d then use only 1/d of table. This is likely if m has a small divisor e. g. 2. • if m =2r then only look at r bits of key! Good Practice: A good practice to avoid common regularities in keys is to make m aprime number that is not close to power of 2 or 10.Key Lesson: It is inconvenient to find a prime number; division slow.Multiplication Method: h(k) = [(ak) mod 2w] � (w − r) where m =2r and w-bit machine words and a = odd · integer between 2(w − 1) and 2w .Good Practise: a not too close to 2(w−1) or 2w .Key Lesson: Multiplication and bit extraction are faster than division.wkaxr}Figure 4: Multiplication Method 6
Description
Upon completion of this lesson, you should be able to better under stand How to solve the Dictionary Problem with the help of Python and the basic logic behind Hash functions,What is Chaining in Hashing concept? Example of Simple uniform hashing and differnt “Good” hash functions.
Instructors: Prof. Erik Demaine, Prof. Ronald Rivest, Prof. Srinivas Devadas, MIT Course Number: 6.006 Level: Undergraduate, 6.006 5. Hashing I: chaining, hash functions, Introduction to Algorithms, Electrical Engineering and Computer Science, Engineering, Massachusetts Institute of Technology: MIT Open Course Ware, http://ocw.mit.edu (02-11-2011). License: Creative Commons BY-NC-SA: http://ocw.mit.edu/terms/#cc.
Presentation Transcript
Your Facebook Friends on WizIQ