I see
talismancer has just managed to sneak in a post for yesterday. I can't really mock him for that one, seeing as I did the same several times during last year's run (and in fact while last night's post has a timestamp of 11:59:59 it was actually submitted a few seconds later).
Anyway, on to today's post. And today's random topic shall be on why hashing algorithms are not unique identifiers.
Let's take the SHA-1 algorithm. It's a cryptographically secure hashing algorithm that produces a 160-bit output. Now, 2^160 is a stupidly large number: over 1.46×1048. To put that in perspective, if you multiplied that number by 100 the result is more than the number of atoms in the earth (estimated at 1.33×1050). That should be large enough to give a unique output for any input, right?
Wrong.
Let's say that you somehow generate every possible sequence of 20 bytes (all 2^160 of them), and feed each one in turn to SHA-1. You'd get a different 20-byte sequence output for each input - in fact, you'd get 2^160 different outputs (this does assume that SHA-1 can generate a different output for each 20-byte input - it may well not do this, in which case you've already got your collision). Now generate a 21-byte sequence, and hash that. The result will match one of your previous outputs, since there are only 2^160 possible outputs and you've already generated every single one.
Of course, doing such is completely impractical. That doesn't mean that hash collisions won't happen - they're still possible, since for every possible output there are an infinite number of inputs that will give that output. I'll grant that actually finding a collision is extremely improbable, but "extremely improbable" is not the same as "impossible" and does not mean "cannot happen tomorrow". The more inputs you try the higher the chance of a collision is as well, due to the birthday paradox - if one were able to generate 2^80 inputs (still completely impractical, but less impractical than 2^160 inputs), then the chance of a collision is 50%.
And that's why things like Mercurial and git make me slightly nervous, since that uses SHA-1 as a unique identifier and just assumes that a hash collision won't ever happen.
Anyway, on to today's post. And today's random topic shall be on why hashing algorithms are not unique identifiers.
Let's take the SHA-1 algorithm. It's a cryptographically secure hashing algorithm that produces a 160-bit output. Now, 2^160 is a stupidly large number: over 1.46×1048. To put that in perspective, if you multiplied that number by 100 the result is more than the number of atoms in the earth (estimated at 1.33×1050). That should be large enough to give a unique output for any input, right?
Wrong.
Let's say that you somehow generate every possible sequence of 20 bytes (all 2^160 of them), and feed each one in turn to SHA-1. You'd get a different 20-byte sequence output for each input - in fact, you'd get 2^160 different outputs (this does assume that SHA-1 can generate a different output for each 20-byte input - it may well not do this, in which case you've already got your collision). Now generate a 21-byte sequence, and hash that. The result will match one of your previous outputs, since there are only 2^160 possible outputs and you've already generated every single one.
Of course, doing such is completely impractical. That doesn't mean that hash collisions won't happen - they're still possible, since for every possible output there are an infinite number of inputs that will give that output. I'll grant that actually finding a collision is extremely improbable, but "extremely improbable" is not the same as "impossible" and does not mean "cannot happen tomorrow". The more inputs you try the higher the chance of a collision is as well, due to the birthday paradox - if one were able to generate 2^80 inputs (still completely impractical, but less impractical than 2^160 inputs), then the chance of a collision is 50%.
And that's why things like Mercurial and git make me slightly nervous, since that uses SHA-1 as a unique identifier and just assumes that a hash collision won't ever happen.
no subject
Date: 2011-11-03 07:23 pm (UTC)The other issue I have with the whole head-in-the-sand approach is that it's not hard to work around this. Off the top of my head you could append a sequence number to your hash, you could have a collision chain and walk the list looking for your exact data, or you could add a dummy field to the data and modify that until you don't get a collision. Admitidly they all have limitations, but they're very deterministic limitations (like a fixed upper limit on the number of supported collisions). I think git even has a compile-time option for one of these.
Or as an alternative, one could use an identifier that's designed to be unique. A GUID is a good example (though that does assume everyone is playing by the rules and not doing naughty things like reusing MAC addresses), or you could derive it from your public IP address at which point all you need to append is a locally unique identifier.
Random aside: we have a piece of software that will handle upwards of 4000 messages per second. A quick back-of-the-envelope calculation makes that about 1.5×1011 messages per year. Run it on a million servers, and there's your 1017 :)