Talk:Huffman coding
From Wikipedia, the free encyclopedia
~ I believe I may have better solved the example "Sample-1". I may be wrong, but I found the algorithm ill explained in my notes so I visited wikipedia to better understand it... I tried to solve it in the most logical way to me... building a complete binary tree with apropriate depth and using the percentage costs as guides to choose which boundaries to choose.
What I got was this: There is 1 character with weight 30; 1 with weight 20; 3 * 15; 1 * 10 and 3 * 5 for a total sum of 120. Dividing all by 120 and using the percentage to define how much of the tree to take away you get that 30/120 = 25%... and you go on like that... using this technic I got:
Character | Weight | Coding |
---|---|---|
h | 30 | 00 |
e | 20 | 010 |
b | 15 | 011 |
d | 15 | 100 |
g | 15 | 101 |
a | 10 | 1100 |
c | 5 | 1101 |
f | 5 | 1110 |
i | 5 | 1111 |
Total cost is exactly the same, but the coding varies in size between 2 and 4 bits, and in #Basic_technique it says that "It is generally beneficial to minimize the variance of codeword length." I believe that makes my solution better, so if anyone with more than a mere day's knowledge on the matter agrees, then by all means make the correction :) --nunocordeiro 02:02, 11 July 2006 (UTC)
- You got the right answer for the wrong reasons. You used Shannon-Fano coding, which in this case happens yield a result identical (in terms of each symbol's weight) to the bottom-merge Huffman code, the Huffman code with the lowest variance and lowest maximum length (among optimal codes). In general, these are not the same, and Shannon-Fano is suboptimal. The example should be replaced by one that either yields only one Huffman code (again, in terms of symbol weights, so {0,1} is the same code as {1,0}) or explain bottom-merge Huffman coding. I'll put this on my to do list, but if someone wants to fix this, go ahead. Calbaer 02:10, 27 July 2006 (UTC)
Removed "*Animation of the Huffman Algorithm: Algorithm Simulation by Simon Mescal" since the link is dead. If it comes back up move back in.
What is meant by "cost" in this article?
- Yeah, I was wondering too look, what I found [1], now I see. The cost is the cost to send each letter in the target alphabet. For example if we send letters as in Morse with "." & "_" we will spend more on "_". Gnomz007 15:07, 16 Nov 2004 (UTC)
Or to qualify a little more, the letter S in morse is three dots in rapid sucession. Three time periods. The letter 0 is three dashes. Each dash takes up at least two time periods. So, at least six time periods. Therefore, sending O is at least twice as expensive (in time!) as sending an S. You would think that a byte is always 8 characters long and so this has no meaning. However, most US english only uses about 7 characters. If you find that no 8-bit characters are being used in the file -- or if you find a way to mark them as such -- you can compress the file down to 7/8th its original size right there. Consider some of the unicode encodings, where most characters take 8 bits, but if you want an extended character, you can use extra bytes is you flag them. --67.177.235.1 18:55, 4 October 2005 (UTC)
- Oh no! This is wrong, wrong, wrong (in the context of the article). For most applications, costs are probabilities, pure and simple. But they need not be, because they are just multiplicative factors (or, as the Basic technique section has it, weights) for the expression to be minimized. Adding to the confusion, the Huffman coding with unequal letter costs subsection uses cost in the way you two and the linked paper are discussing, as output costs! It seems as though an information theorist wrote most the article but a computer scientist wrote the Problem definition section. Anyway, I think most people who read the entire article would come away very confused. Looks like it's time for a rewrite (or at least a find/replace of the word 'cost').... Calbaer 02:22, 27 July 2006 (UTC)
[edit] Back-story on the Discovery of Huffman Coding
Dr. David A. Huffman was the Chairperson of the Information and Computer Science Department at University of California at Santa Cruz (UCSC) in the early 1970's. He taught an undergraduate course, "The Algebraic Foundations of Cybernetics" in 1971-73. During one lecture he taught "Minimum Redundancy Coding". A student noticed, in some textbooks, it was called "Huffman Coding" -- and asked him if that was the same principle.
Dr. Huffman replied by relating an interesting story.
As a student at MIT, David Huffman had been working on an unsolved, challenging problem in Information Theory. Instead of studying for the final, he had decided to devote all his free time to work on this problem. The night before the final, with no solution yet, in a moment of despair, he threw his large stack of notes into the trash. He described the next moment as a flash of insight in which the solution occurred to him.
He went on to explain that he was still working on many difficult, interesting problems that, when solved, could be represented in textbooks for centuries. He did not want such a "simple idea" as minimum reduncancy coding to bear his name. Students wondered and discused among themselves: was this humility, or hubris?