Talk:Hash table

From Wikipedia, the free encyclopedia

This is the talk page for discussing improvements to the Hash table article. This is not a forum for general discussion about the article's subject.
Please sign and date your posts by typing four tildes (`~~~~`). Put new text under old text. Click here to start a new topic. New to Wikipedia? Welcome! Ask questions, get answers.	Be polite Assume good faith No personal attacks Be welcoming	Article policies No original research Neutral point of view Verifiability

Archives
September 2001 - January 2006

1 Who is Bob Jenkins?
2 Pseudo-Code
3 O(1) Removal
4 Something i don't understand from the article
5 how do you delete from a hash table that uses probing?
6 More concrete suggestions for hash function?
7 Benchmarks
8 Unfinished question
9 joaat_hash function error
10 cryptographic hash functions
11 Open hashing

[edit] Who is Bob Jenkins?

I'm glad he decided to contribute so much, but honestly it seems like he is just plugging his website. Unless his work has been published in a peer-reviewed journal, I don't think it should be continuously cited throughout this article. A simple mention of his site in the external links section would suffice. --Kibblesnbits 17:16, 13 March 2007 (UTC)

[edit] Pseudo-Code

I have a few small issues with the current Hash table article. First, the pseudo-code is a little convoluted—in that the findSlot(..) function is not 100% obvious. It hides away the linear probe code, which makes reading the set(..) and lookup(..) functions a little confusing on first read. I was just going over it with an intro computer science student and even had to read it twice myself. Josh 08:11, 10 April 2006 (UTC)

OK, I tweaked find_slot() to make it more obvious (to me) that it is doing a linear probe.

Or did I just make it more confusing?

Please feel free to improve it further.

[edit] O(1) Removal

There is a line which states:

The O(1) remove method above is only possible in linearly probed hash tables with single-slot stepping. In the case where many records are to be deleted in one operation, marking the slots for deletion and later rebuilding may be more efficient.

I'm not really sure that this is accurate. I might agree wth calling it expected O(1), or average O(1). But it is not, as far as I can tell, worst-case O(1). In fact, remove is worst-case O(n), as in all hash operations.Josh 08:11, 10 April 2006 (UTC)

In my experience the O() notation usually refers to expected runtime, unless otherwise stated, but feel free to disambiguate it.WolfKeeper 14:07, 10 April 2006 (UTC)

I know that there has been some discussion of amortized runtime here, but I'm not really sure that you can use amortized analysis on this function. Please correct me if I am wrong, otherwise, I will change shortly. Further, I am not sure why the article says only possible in linearly probed hash tables... If the remove function is O(1) in linear probe, then why not with a quadratic probe? Josh 08:11, 10 April 2006 (UTC)

You need the delete to remove spaces. the quadratic probe moves a different amount through the hash table, depending on the initial number from the hash function, rather than the slot that the initial collision occured at. So two hash entries may have collided initially at say, slot 10; and one of them got put into slot 20. If slot 10 is deleted, with quadratic probing there's no way to compact down the slot 20 into slot 10, because there's no way to find it- the hash of the object at 10 doesn't tell you how to find any of the successors. With linear probing you just need to look in the next entry and then decide whether to move it or not.WolfKeeper 14:07, 10 April 2006 (UTC)

I think that this discussion reveals a real need to disambiguate expected vs. worst-case runtime. I think that we'll both agree that remove(..) in a hash table requires worst-case O(n) steps, even with a linear probe. If you don't then we need to first discuss that..

If you don't see why the worst-case isn't particularly relevant, then I don't want to discuss this further.WolfKeeper 16:31, 10 April 2006 (UTC)

I found a very good explanation in CLRS03, Chapter 11. Basically, the text proves that, on average, $\frac{1}{1-\alpha}$ probes are needed in an unsuccessful search, where

α

is the load factor. Using this formula, we can see that even at

α = .9

, the number of probes required is 10. I would still argue that worst case is relevant, but this clearly does not apply until the hash table is very full. The magic number of 80% is explained in the article, but perhaps this would be better understood with a graph illustrating how the performance changes with load? --Josh 01:24, 11 April 2006 (UTC)

Now, lets go back to what you're saying. Yes, you are correct, that in a linear probe, you only need to look at the next slot, although I would nitpick and say that you need to look at the next n slots if the hash table is full. However, in a quadratic probe, instead of saying

s l o t = s l o t + 1

, you say

s l o t = h a s h (k e y) + c 1 i + c 2 i 2

. We assume that hash(key) was cached and does not need to be recomputed. Therefore, as long as we know i, we can clearly find the next slot in constant time. The same applies for double hashing. You just say

s l o t = h a s h (h a s h (k e y))

. Since we already know hash(key), then hash(hash(key)) takes O(k), as per previous discussion.

I'm sorry if my explanation was inadequate. I would recommend you read Knuth or a good book on it.WolfKeeper 16:31, 10 April 2006 (UTC)

I've looked through CLRS, and I don't really see a reference for the fact that an O(1) remove can only be achieved using a linear probe. If we can do a search in any open address hash table in constant time, then shouldn't we be able to delete elements and eliminate holes in constant time as well? --Josh 01:24, 11 April 2006 (UTC)

In summary, I think that there needs to be a lot of reworking between the term O(..) and expected O(..) and worst-case O(..). Further, I think that we might argue that remove(..) is expected O(1), but not worst-case O(1). I think that it is always worst-case O(n) with linear and quadratic probing. With double hashing, we would say that it is worst-case O(k n), assuming that k is not fixed size. Josh

With all due respect, I completely disagree. The whole point of hash tables is to design them so that you are able to apply statistical techniques like averaging. Using worst case values for randomised parameters gives significantly inaccurate results.WolfKeeper 16:31, 10 April 2006 (UTC)

I'm not suggesting that we re-do all of the analyses in terms of worst-case performance. I understand that doing so would give an incredibly inaccurate picture of hash table performance. I simply think that we should qualify the asymptotic notation with "expected", to acknowledge that bad cases do exist. --Josh 01:24, 11 April 2006 (UTC)

I agree with Josh here. Unqualified big O notation specifically implies worst-case behavior in many publications. Better to either add the word "expected" where appropriate or make a big sweeping blanket statement at the beginning that it's implied. Deco 11:30, 8 June 2006 (UTC)

[edit] Something i don't understand from the article

If when adding to a table the hashing function leads to a collision for 2 particular keys, a and b, then using probing b will be stored somewhere after a. When then looking up the value associated with b, won't the table actually return the value associated with a. How does it know which of the two values to return, and how to get to the index associated with b?

I hope that makes sense. Iae 11:22, 8 June 2006 (UTC)

Providing we're not using perfect hashing, we must search through all the places the desired element could be, comparing our search key with each one, until we hit the end of the list. This is why hash tables require not only a hash function but a means of comparing for equality. This only works efficiently because in a hash table with enough room, these lists are overwhelmingly very short. Deco 11:27, 8 June 2006 (UTC)

Ah I see, thanks very much. Iae 11:49, 8 June 2006 (UTC)

[edit] how do you delete from a hash table that uses probing?

How do you know that you have hit the end of the list. Or to cut to the gist of the matter; How do you delete from a hash table that uses probing.

You know you hit the end of the list in a (probed) hash table when you hit a empty "not occupied" slot. In theory, one could have a "occupied" bit for each row of the hash table that is initially 0. (In practice, typically each slot that is not occupied begins with a NULL byte. If it *is* occupied, that byte is the first byte of the key).

I know of 2 very different ways to delete from a hash table: the "deleted bit" method", and the "move stuff around" method. (My understanding is that the "move stuff around" method is impossible to implement with "quadratic probing" or "double hashing" hash tables (and probably a few other types). Those hash tables are forced to use the "deleted bit" method.)

Let me try to explain those 2 methods

"deleted bit" method: works with any kind of hash table. Every row of the hash table (in addition to the key, value pairs) has a "deleted" bit that starts out 0. To delete a record from the hash table, use the key to find it (using find_slot), then set the "deleted" bit to 1. (Some hash tables cram the "deleted" bit and the "occupied" bit into the "key" part of the record, reserving 1 special key to indicate "unoccupied", another special key to indicate "deleted", and any other value to indicate a real occupied "key" ).

"move stuff around" method: only works with linear probing.

The function remove(key) in the article is supposed to describe the "move stuff around" method. How could we make it easier to understand?

Often when the application deletes a record, the following slot is already not occupied. In that case you can wipe out the record by marking that record as not occupied -- overwriting the key with a NULL byte -- and you are done.

Unfortunately, there are a bunch of special cases the application needs to be able to handle, even though they rarely happen. As the article says, "For all records in a cluster, there must be no vacant slots between their natural hash position and their current position (else lookups will terminate before finding the record)." The application needs to scan through *all* the records between the record you want to delete, and the next "not occupied" slot, and make sure the above requirement is met. In some cases, you must move all those records up 1 slot. In other cases, you must move some (but not all) of those records.

In yet other cases, you must not move any of them, just mark the deleted slot "not occupied" just like the simple case. (For example, if you want to delete the record in slot 870, and you see that the "key" in slot 871 actually hashes to "871", and slot 872 is "not occupied" -- you must mark slot 870 "not occupied", and leave the record in slot 871 alone).

Once you understand how it works, please update the article to make it easier to understand for the next person.

--68.0.120.35 07:42, 5 March 2007 (UTC)

[edit] More concrete suggestions for hash function?

I wanted to check in here before making huge changes to the article, but one thing I'd find very helpful is a discussion of concrete choices for the hash function itself. Here's an outline of what I'd say:

It's very common to implement hash tables with poor hashing functions. Knuth is largely to blame, advocating the very weak "multiplicative hash" function, and even going so far as to claim that its clustering property is a good thing! (section 6.4, TAoCP vol III, 2e, p. 517). Variants of the multiplicative hash remain popular, as do other linear techniques such as CRC.

Surveying the field, two excellent newer alternatives stand out. For most simple hashing tasks, the Fowler Noll Vo Hash is an excellent performer. It is among the simplest of known hash functions, is quite fast, and has a good distribution. For use in a hash table, the FNV-1a variant is likely the best choice, as it has better (more dispersed) clustering behavior than FNV-1.

For some applications, particularly when keys are long, the newer Jenkins lookup3.c hash function may be a better performer. It achieves better speed by consuming 12 bytes of the input per iteration, as opposed to one byte for FNV. Disadvantages include greater code complexity, a sensitivity to machine endianness (causing potential difficulties when reusing hash values across disparate computers), and the need to pad byte-aligned input to 32 bit words. Do keep in mind, though, that benchmarks showing impressive speed for large blocks of data may not translate to real-world gains. Common usage patterns involve relatively short keys, so the amount of time spent in the hashing function inner-loop may be less relevant than, say, the gains from a compiler being able to automatically inline a simpler hash function.--Raph Levien 02:32, 5 July 2006 (UTC)

Some interesting links I came across while searching:

Hash functions by Paul Hsieh
The Art of Hashing
Hashtables, Part 2 by Maciej Stachowiak
Hash Functions at Pluto Scarab - with nice graphs

You might also want to check out HSH 11/13 which seems to distribute quite well (as per published graph) and also performs nicely with its handful of code lines.

Followup:

I went ahead and completely rewrote the section. I may have come close to bending the letter of the law on doing original research (measuring avalanche behavior of Jenkins One-at-a-time hash using Bret Mulvey's tools) and NPOV (I advocate particular hash functions and ~~trash~~ criticize other popular choices), but everything I said is verifiable, and I think the result is the most helpful and reliable advice on choosing hash functions anywhere on the Internet or in dead tree form.

Avalanche behavior of HSH 11/13 hash over 3-byte keys

I measured the HSH 11/13 hash using Mulvey's AvalancheTest, and found it slightly inferior to the Jenkins One-at-a-time hash. Herbert Glarner's tests for uniform distribution are not as sensitive as Mulvey's chi-squared tests. Further, HSH 11/13 is quite slow, because of its inner loop.

Obviously, I also changed my mind about FNV, swayed in large part by Mulvey's analysis. It's not a bad choice, being one of the simplest of the "pretty good" choices, but even there the XOR-folding adds a bit more complexity. In any case, I feel like I've provided enough information for people to make their own informed choice.--Raph Levien 21:31, 12 August 2006 (UTC)

I agree that this analysis of various hash functions are worth putting into Wikipedia.

But wouldn't the hash function article be a better place?

Or are there special considerations for hash functions used in a hash table that don't apply to hash functions used for other purposes?

--68.0.120.35 07:42, 5 March 2007 (UTC)

[edit] Benchmarks

Simplicity and speed are readily measured objectively

There is a caveat here. Experimental measurements of speeds are necessarily done on a "representative sample" of inputs. It may be the case that such or such algorithm performs with varying speed depending on the kind of inputs, and that some sample, representative of one kind of inputs, may not be representative of another kind. I don't think this would happen with usual hash functions on e.g. strings but this may happen in more esoteric cases. David.Monniaux 14:38, 13 September 2006 (UTC)

[edit] Unfinished question

Problem 1: The Hash Table will be used for storage of student records. You may assume the maximum number of Items will not exceed 240. An item corresponds to a student record. A key is taken to be an ASCII character string of maximum length 32 containing name of the student. The data is taken to be 9-digit id and the name of the state of residence of the student in India. A list node contains an Item and a reference to the next node. You will need to implement the classes Item and the ListNode with construct and appropriate data access operations. 2. Write a main() program to test your class HashTable. 3. Input: The name of the input file, containing data items (student records) and data operations in the format given below, and the name of the output file, will be passed to your program on the command line or interactively. Data Items: student-id, student-name, state-of-residence Example. 200412018, Swati Mishra, Uttar Pradesh one per line. Data Operations: <operation> <item> The <operation> is s,i, or d for search, insert and delete, respectively. The item will have fields: student-name, student-id, and state-of-residence. The fields of an item will be separated by commas, and operation will be separated from the item by a colon ”:” and a space. Example. s: 200211001, Akash Gokhale, Maharashtra one per line. The data items will be separated from the data operations by a blank line. 4. Output: Your program is to read the input file and populate the Hash Table with student records by repeated Insert() operations. And then print to the output file, the size of the Hash Table, and the size of each linked list in the Hash Table. Then, it will continue to read each line and execute appropriate data operations. Following each Insert()/Delete() operation, it will output the size of the Hash Table and the size of each of the linked list, and for each Search operation, it will output

-- 220.225.53.35 09:18, 11 October 2006

[edit] joaat_hash function error

Hi, I've implemented the joaat_hash function which is described in pseudocode on this page in my C Program, and encountered an error. Better said I produced one, since I misunderstood len as the len of the hashtable, and not as len of the key.

here is my (correct) function implemented in C, please consider updating the article:

int joaat_hash(char *key, size_t len) //len is the size of the hashtable
{
    unsigned int hash = 0;
    unsigned int i;
    
    for (i = 0; i < strlen(key); i++)
    /* [...] as in the article */ 
    
    return (hash % len);
}

--88.76.141.17 03:43, 2 January 2007 (UTC)

You are right -- the "len" in the pseudocode in the article is the length of the key. (That pseudocode gives the same results as the original C implementation "One-at-a-Time Hash" ub4 one_at_a_time(char *key, ub4 len) http://www.burtleburtle.net/bob/hash/doobs.html , right?)

I think the above implementation gives the same results as the original C implementation for all possible ASCII strings. Unfortunately, the above implementation gives different results for other sorts of data structures cast into byte arrays, when those data structures include a zero byte.

Since the original version gives *better* results for those data structures, and the *same* results for everything else, I think I'll stick with the original version, except use "key_len" to clarify exactly of what it is the length. (Or have I missed something obvious?) --68.0.120.35 07:42, 5 March 2007 (UTC)

[edit] cryptographic hash functions

From the article: "In fact, even a cryptographic hash does not provide protection against an adversary who wishes to degrade hash table performance by choosing keys all hashing to the same bucket."

I thought that one of the criteria for a cryptographic hash function was that it be infeasible to find collisions. Therefore, it would provide defense against such an adversary. Or am I misunderstanding something? Ralphmerridew 21:55, 30 January 2007 (UTC)

Hash tables are usually very limited in size, and the length of the hash function is clipped modulo the hash table size. That is, while a hash might produce a result of 160 bits, it is obvious that a hash table of 2¹⁶⁰ entries would be infeasible, not to mention useless. In-memory hash tables rarely exceed millions of entries, and it is relatively trivial to brute force through this amount of hashes for finding collisions. For a sense of magnitude, a low-end processor today can compute around half a million SHA-1 hashes of 16-character strings per second. -- intgr 23:07, 30 January 2007 (UTC)

Doesn't that require that the attacker also knows the hash table size? (Alternately, if the attacker adds enough entries to be able to work out the size, it's also likely to force a rehash.) Ralphmerridew 00:51, 31 January 2007 (UTC)

Well yes, if secrecy is a choice, it's a good idea to choose a large prime as the hash table size. This is, however, unrelated to whether one is using a cryptographic hash function or a classic one — it is impossible to cause collisions if you cannot predict the modulus or one of its factors.

But lots of hash table implementations resize automatically to predefined constant sizes, and as far as I know, many even use the worst choice of power-of-two sizes. Power-of known number sizes mean that the attacker can cause clustering even if they mispredict the size, and that the attack is effective even through reclustering. This is because $hash\ %\ 2^n = hash\ %\ 2^m$ when

n > m

, and even if

n < m

, it will just cause clustering on

2 m - n

different keys simultaneously, which can still be fatal with large hash tables. -- intgr 07:19, 31 January 2007 (UTC)

I have no idea what I was thinking earlier, the equation should be $\forall\ n > m \and hash_1\ %\ 2^n = hash_2\ %\ 2^n: hash_1\ %\ 2^m = hash_2\ %\ 2^m$ . -- intgr 17:41, 31 January 2007 (UTC)

Re point 1, with a bad hash function, an attacker can choose keys that hash to the same value, which will cause collisions regardless of modulus. Ralphmerridew 16:41, 31 January 2007 (UTC)

Oh, were you thinking of generating hashes small enough to always be less than the modulus? Good point, never thought that. (though I'm not the author of the quoted claim) -- intgr 17:41, 31 January 2007 (UTC)

No, I mean that, with a bad hash function, an attacker could, say, generate arbitrarily many strings that hash to, say, 0x16de fa32 4261 1ab3. Whatever modulus is used, all those strings will fall into the same bucket. With a cryptographically secure hash function, given a known modulus, an attacker might be able to produce a large number of strings such that (hash % modulus) are all the same, but he'd be unable to produces a significant number that all have the same hash value. Ralphmerridew 21:43, 31 January 2007 (UTC)

I wouldn't count on the speed (well, slowness) of a hash function for protection against clustering attacks. While it makes the attack slightly more expensive, it also complicates hash table inserts and lookups by the same proportional amount. If you can cope with the extra processing overhead, you're most likely better off with data structures that do not exhibit such critical worst-case performance, such as various balanced trees.

Perhaps this is an overly cryptographic point of view, but it is not that expensive to generate truncated hash collisions for cryptographic hash algorithms by brute force. And as mentioned above, a general purpose low-end processor (Athlon 64 3000+ here) is capable of generating half a million hashes per second. A heavily parallelized FPGA-, or even ASIC-based chip could improve that by several magnitudes. (Such programmable FPGA computers are readily available on the market, e.g. COPACOBANA, which can do 10¹³ DES trials per second) -- intgr 22:37, 31 January 2007 (UTC)

I'm not depending on "Given string 'str', calculate hash(str)" being slow; I'm depending on "Given 'value', find a string 'str' such that hash(str) == value". The latter is part of the definition of cryptographically secure. And even 10^13 trials/second means brute force will take about three weeks per full collision with a 64 bit hash, and is ineffective against even the deprecated MD5 (128 bits) or SHA-1 (160 bits). By comparison, IIU multiplicative hash C, it's possible to find a full collision in O(#bits) time. Ralphmerridew 23:12, 31 January 2007 (UTC)

Did you forget that to utilize all these 64 bits, you need to store the table somewhere? There are no practical applications of a 2⁶⁴-entry hash table, and the space requirements are proportional to the brute force collision-finding time (both scale O(2ⁿ bits)). Just storing this many hashes (or 8-byte values) alone will take up $2^{64} * 8\ bytes = 128\ exabytes$ of space. If you create a smaller hash table, you have to truncate the hash (throw away some of its information), which inherently speeds up brute force cracking. -- intgr 23:29, 31 January 2007 (UTC)

But a collision against a truncated hash is only useful against a small number of moduli, and then only if the attacker knows or can work out the modulus. Ralphmerridew 23:57, 31 January 2007 (UTC)

This seems to effectively boil down to what I formulated earlier, "were you thinking of generating hashes small enough to always be less than the modulus?". I do agree that in case the attacker does not know the modulus, and the modulus is a non-tiny prime, then this effectively disables clustering attacks. I disagree that when the modulus is known, the hash table needs to be "small", but let's leave it at that. -- intgr 00:54, 1 February 2007 (UTC)

Well, I've changed the article now and put a {{fact}} on it, since we still need to cite a source for it. Or do you happen to have one? -- intgr 12:41, 5 February 2007 (UTC)

I agree that, for a certain special case, Mallory (the attacker) can guarantee that all the keys he generates hash to the same bucket, even when a cryptographic hash is in use.

That special case happens when an attacker doesn't know the exact table size, but does know that it is a power of 2, and at most some maximum size -- Mallory knows that slot_number = hash(key) % 2^n, and he knows the particular cryptographic hash() used, and although he doesn't know n exactly, he knows some k such that n <= k.

By doing some work reminiscent of hashcash to zero out the k least-significant bits, Mallory can generate keys that all hit the same bucket. Mallory generates O(2^k) trial keys before he finds one that he knows will hit the same bucket. (With most non-cryptographic hashes, Mallory only needs to do O(1) work to construct a key that will hit the same bucket).

But so what? Is there any application where this relevant?

What sorts of applications need to accept keys from a potentially malicious attacker?

Say we did the "secure" thing by picking a L bit cryptographic hash, and "randomly" picking a new large prime number p every time we resize the table, and using slot_number = ( hash(key) % p ) % tablesize. (Since tablesize << p << 2^L , does it matter whether the tablesize is a power of 2 or not?).

Even with this "secure" hash -- even if Mallory has no clue what hash we are using or how we are reducing it to the tablesize -- Mallory could still force collisions by sending O(tablesize) submissions.

(By the birthday paradox, he is *likely* to cause a collision even with O(tablesize^(1/2)) submissions).

Is there some application where this "secure" hash would work, but the above power-of-2 "special case" wouldn't work?

--68.0.120.35 07:42, 5 March 2007 (UTC)

[edit] Open hashing

I'm considering merging the open hashing article with this article, as has already been done for closed hashing. For that I think we should rename the section on Chaining to Open hashing and make more clear the already mentioned fact that linked lists are only one way, though perhaps the most popular one, of organizing hash buckets and introduce a mention of buckets allocated in a contiguous memory/disk space that is presently discussed in open hashing. Jyotirmoyb 05:16, 20 February 2007 (UTC)

I agree -- let's keep all the different types together in this one article until the article gets too big. At that time, it will (hopefully) be easier to look at the article and decide the "natural" breaking points to split it up into multiple articles. Big Buckets First. --68.0.120.35 07:42, 5 March 2007 (UTC)

Makes sense - can't have it both ways - with closed hashing merged (thought it's hard to find any content called "closed hashing" in the article) and open hashing in its own article. On the other hand, there is no problem with an article on "open" and "closed" and other types of hashing if there is enough content to justify. Does wikipedia ave guidelines on article length??--121.45.246.110 14:10, 9 April 2007 (UTC)

Retrieved from "http://en.wikipedia.org../../../h/a/s/Talk%7EHash_table_bc73.html"

Talk:Hash table

From Wikipedia, the free encyclopedia

Contents

[edit] Who is Bob Jenkins?

[edit] Pseudo-Code

[edit] O(1) Removal

[edit] Something i don't understand from the article

[edit] how do you delete from a hash table that uses probing?

[edit] More concrete suggestions for hash function?

[edit] Benchmarks

[edit] Unfinished question

[edit] joaat_hash function error

[edit] cryptographic hash functions

[edit] Open hashing

Views

Navigation

interaction

Search