Can someone who knows crypto shed some light on the "gravity" of this algorithm? Is this going to out and out replace MD5, or will it fall among the many "awesome, but not widely used" hashes out there? It sounds pretty awesome... almost too awesome!
MD5 has already been out-and-out replaced. For applications that require collision resistance, MD5 is already dangerous. Even in a setting where MD5 is unlikely to cause real problems, like in an HMAC authenticator, its use would still be flagged in an audit.
No competent new design in 2014 would use SHA1 either.
In the world we live in now, there are two "mainstream" options for hash functions: SHA2 (ie, SHA2/256, SHA2/512, &c) and SHA3. SHA3 is actually not at all mainstream (very few applications use it), but (a) that will change over time and (b) nobody will really bat an eyelid if you choose to use it.
What Zooko is advocating for is a third mainstream option: BLAKE2. BLAKE2 is derived from a well-regarded SHA3 finalist; the preponderance of evidence suggests that BLAKE2 is safe, and arguably a more conservative choice than SHA3 (a/k/a Keccak, pronounced "Ketchup"), because SHA3's novel sponge function design hasn't been banged on the way SHA2's design, which BLAKE sort of shares, has.
The reality is that for most software designers, this isn't an important issue; most software isn't constrained by the speed of its hashing algorithms (the common bottleneck in crypto code is the bignum work public key algorithms have to do --- if you're worried about bottlenecked on hashing, you probably have a much bigger problem with I/O).
I was wondering how the new "optional parallel mode" affects the long history of professional scrutiny. Is adding an optional parallel mode something that a professional cryptographer can do without significantly changing the security model?[^1] Are there certain families of hash designs amenable slight tweaks that sacrifice a little security for a relatively large speed improvement or is this something that could happen with the majority of hash schemes?
As always, thanks for playing "HN asks tptacek about security."
[^1]: This is a genuine question. I do not have any reason to doubt zooko et al's judgement. I guess another way of saying this is "For someone smarter than me is it easy to know when this is a good/bad idea?"
Strange segway: I was browsing Plos and saw this article about egg washing and shell characteristics. After reading the title for some reason I remembered reading a somewhat lengthy comment you wrote about the safety of eggs and different egg washing regimes in the us and the uk. I did not check too see if my recollection was accurate but in case it is and the comment was indicative of your interest in the subject you might get a kick out of:
Effect of Egg Washing and Correlation between Eggshell Characteristics and Egg Penetration by Various Salmonella Typhimurium Strains. PLoS ONE 9(3): e90987. doi:10.1371/journal.pone.0090987
If you do the padding and signaling of last nodes correctly, which as far as we know BLAKE2 does, a break in the parallel mode translates to a break in the sequential hash function. This was proven by the Keccak people a while ago [1].
With serious protocol designers now largely side-stepping the IETF due to BULLRUN infiltration, BLAKE2 and other non-standard primitives have a better shot than before of seeing mass adoption. I hope to see BLAKE2 widely used in the future.
I also hope to see more non-standard crypto and protocols, where "the market" leads the way, and standards groups try to keep up in order to appear legitimate.
> I also hope to see more non-standard crypto and protocols, where "the market" leads the way
This is super-dangerous, unless the amorphous "market" is also paying for cryptanalysts to bang on the crypto primitives as a public service to all competitors in the market.
After all, RSA adopting Dual EC DRBG was a business decision, and one which the market didn't reverse despite Dual EC DRBG being publically known to have a probable backdoor since 2007.
Maybe we're using different definitions of standardization.
The IETF writes RFCs which developers are expected to follow, and (mostly) do. This is a standardization of sorts, but it's beside the point I was making.
I'm not talking about Joan Daemen wrt BULLRUN. I'm referring to secure protocols that offer RC4 but not Salsa20, TLS without a single constant time cipher, 112-bit security, secure protocols that aren't even encrypted, null ciphers, Dragonfly, cipher suites so complex that they're expected to be implemented wrongly, secure protocols made so complex they won't be used at all, crypto advisory groups run by NSA employees, etc.
TLS offers RC4 and not Salsa20 because RC4 existed when SSL was first defined by Netscape, and Salsa20 didn't, and wouldn't for over a decade.
It's worth mentioning here that the "encrypted by default" Internet that was the dream of the 1990s was a government project, and TLS more or less thwarted it.
Different ciphers can be more or less straightforward to implement without timing leaks, but "constant time" is a property of an implementation, not of a cipher.
Dragonfly is exceedingly lame, but it's also completely inside-baseball. Even if Dragonfly had been "standardized" by the TLS WG, nobody ever would have used it, because nobody ever used SRP either, and SRP was better.
The ciphersuites in TLS aren't complicated. They're very simple. The problem with TLS ciphersuites isn't that they're complicated but that they're wrong. Which is unsurprising, because they were designed before anything like Bellare and Namprempre; in fact, they were designed in an era where many practitioners believed that message authentication was unnecessary for cryptosystems at all.
I'm not sure which complex secure protocols you're referring to. TLS and SSH are so widely used it seems fair to call them universal.
As for the CFRG chair, well, I won't repeat myself:
In the end, though, the real issue I have with your comment is that the IETF has nothing at all to do with BLAKE2's standards- friendliness. The IETF will soon standardize ChaCha20-Poly1305 for TLS, for instance, despite the fact that no NIST standard will ever do the same.
These reasons don't matter. What matters is that you use a widely used hash function through a widely used software library. Most security issues are implementation-related, not cryptography-related. And if there are problems with a scheme, more people should look for them. SHA2/3 are better options under this light.
sha2 based, and faster than md5? do you really not see the value in this? sure you wouldn't want to use this your own coin, because of say asic, and gpu miners, and your saltless password hash for bruteforcing issues.
but databases, filesystems, checksums for packages, git(to be fair, they're all some sort of storage systems)
by the definition you said above we would still be using md5 in everything.
Actually, even in coins fast hashes have their place (specifically, in Merkle trees; we're using SHA3 in Ethereum precisely because computers will have hardware modules for it), and for mining you don't want to be using anything that's classified as a hash function anyway.
Glad I came across this -- I have a need for a hash function with cryptographic properties but that I don't actually mean to use for crypto, but just to uniquely identify a file by its contents. Speed is really important to me. This sounds perfect for my application.
This author almost completely misses the point. The basic architecture behind SHA-2 is similar to MD5 and SHA-1, and the fact that attacks are possible on MD5 and SHA-1 means an attack could be possible on SHA-2. Then if SHA-3 used the same architecture again, a weakness in SHA-2 might lead to a weakness SHA-3. Then both SHA-2 and SHA-3 are insecure.
Since SHA-3 is now based on a totally different architecture, any weakness in SHA-2 has no effect for SHA-3. At the same time, if someone does find an attack against SHA-3, SHA-2 is still secure.
Because SHA3 is the standard, we' re going to see hardware implementations of it and a lot of research will go into its software implementation.
If you care about performance, stick to the standard.
Another commenter also rightly pointed out that SHA3 implementations will be more scrutinized, and therefore more secure. Exploitable security issues lie rarely in the algorithm.
Even there the standardization process worked: The flaw in Dual EC DRBG was pointed out very quickly, precisely because there was so much cryptanalytic attention paid to it. The fact that RSA (and only RSA, AFAICT) didn't pay attention to the crypto break in exchange for $10MM doesn't change the fact that the vulnerability had been quickly discovered.
If the algorithm had been left unstandardized and was simply "Foo Corp's Custom Wonderful Bit Generator™" the public may never have known of the vulnerability, while the NSA would still have the resources to have discovered the flaw on their own and use TAO to recover the priv key.
Unfortunately, last I heard NIST was trying to weaken the default SHA3 implementation, even compared to SHA2 (give it only 64 bits of security), and not implement Keccak "as is". Keccak as is should be fine, but I wouldn't touch NIST's modified Keccak "for performance" as SHA3.
Also, wouldn't we rather trust software implementations than hardware implementations post-Snowden revelations? Isn't that the same logic behind PRNGs now?
They did not plan to make it 128 bits. The standard output sizes have always been 224/256/384/512 as specified in the competition.
For an ideal secure hash function with 256-bit output it takes ~ 2^128 operations to find a collision, while most other attacks, like finding preimages, takes ~ 2^256.
The sponge construction Keccak uses allows you to reduce the difficulty of finding preimages in return for increased speed, by adjusting the capacity of the sponge.
The idea was to have a fixed "level of security" for each hash, based on the collision resistance, and tune the other parameters based on that. So a 256-bit output would require 2^128 operation for either a collision or a preimage attack.
Listed to tveita and go read the keccak papers, and you will know more! Fighting against NSA is great, but don't judge a new cryptographic hash on rumors.
SHA-1 intrinsics are going to be out in 2015-2016. You're going to be waiting a very long time before SHA-3 intrinsics hit. If you care about software performance, you will not be using SHA-3.
Accelerated SHA1 has been available for a long time, this was necessary for embedded systems.
The NIST is currently modifying SHA3 to make it faster, with some controversy.
You are right in the sense we will have to wait for the definitive standard to get out before we see some heavily optimized implementations.
If you want immediate performance, use SHA2. If you want more "future proof" security, use SHA3 (and one could say that SHA2 is still very good in terms of security).
Stick to the standard, unless you know a lot about crypto.
Excuse my lack of knowledge on the subject. I know MD5 has it cryptographic security issues [1] but I almost always assumed that the main reason it was a 'weak' hash was due to it's speed and how, consequently, rainbow tables we're readily available. If this premise is correct, wouldn't that inherently make Blake2 less strong/secure since its faster?
> I know MD5 has it cryptographic security issues [1] but I almost always assumed that the main reason it was a 'weak' hash was due to it's speed
No, for hashes, you definitely want speed; that's not the problem with MD5. For applications where speed is not appropriate, such as password hashing, you want an iterated hash of some kind. (Disclaimer: as with most crypto, you want one of the existing implementations, not something custom.)
The cyptographic security issues with MD5 concern breaks to the hash algorithm that make it possible to find hash collisions more easily than the number of bits in the hash would imply. A 128-bit hash should require 2128 hashing operations to find a collision against an existing fixed value. (Finding two independently chosen values with the same hash is far easier, due to the birthday problem: https://en.wikipedia.org/wiki/Birthday_problem ) MD5 has algorithmic weaknesses that make it far easier to find collisions, to the point where a current system can find some types of collisions in seconds.
It sounds like you are referring to password hashing. Neither BLAKE nor MD5 should be used for password hashing. At the bottom of the article make sure to read his section "P.S. this isn't about hashing passwords"
For MD5 and SHA-1, given any hash, it's possible to reconstruct the internal state of the hash function. (They're named "hash function," but "hash state machine" is probably a better way to think about it.) For example,
MD5("The quick brown fox jumps over the lazy dog").hexdigest() =>
9e107d9d372bb6826bd81d3542a419d6
Given 9e107d9d372bb6826bd81d3542a419d6, you can reconstruct the internal state of MD5, allowing you to continue hashing data. For example, you could add " while the cow goes moo:"
MD5("The quick brown fox jumps over the lazy dog while the cow goes moo").hexdigest() =>
ab7e4a96438e33094a7df3d41fa89c47
MD5_FromHash("9e107d9d372bb6826bd81d3542a419d6").append(" while the cow goes moo").hexdigest() => ab7e4a96438e33094a7df3d41fa89c47
Ideally, you'd like to be able to reveal the hash publicly without revealing any details about the internal state of the hash algorithm. (It's my understanding that SHA-3 accomplishes this goal automatically.) Another way to accomplish that goal is to use HMAC-MD5 or HMAC-SHA1. At that point you can prove that you hashed something by prepending a secret key to the data. For example,
HMAC_MD5("My 256-bit random secret key", "My message to be hashed").hexdigest()
This would yield a hash value that you can reveal publicly, and which no attacker can append any data to. For any given data D, you can verify that you generated the hash value by computing
HMAC_MD5("My 256-bit random secret key", D).hexdigest()
and comparing it to the other hexdigest. If they're equal, then either some attacker has stolen your private key or you generated the hash.
But if you tried this with plain old MD5, then an attacker can easily append data and trick you into thinking you generated it.
This is actually all the time I have right now, so someone else please feel free to step in and fill in missing details, such as how to find two different messages which hash to the same MD5 value.
Sorry, I wasn't very clear. I was saying that in addition to failing (b), (c), and (d), there's also an additional problem with called the length extension attack.
For MD5, a core problem is that the total number of computable hashes is small, meaning that 1) rainbow tables are possible because the entire hash space can be stored in memory on commodity hardware, and 2) specific collisions can be trivially engineered.
This means that both data authenticated by MD5 hashes (say, software package signatures) and access mediated by hashes (say: passwords, if you're incredibly stupid), can be trivially hacked.
I've taken the opportunity to look up how rainbow tables are used, and in practice, what are constructed are rainbows of likely hits, as well as exploitation of crytographic weaknesses in the MD5 algorithm. So, no, it's not the total address space, but, say, for keywords in multiple languages or known revealed passwords (of which the most common give you tremendous amounts of access -- the top 10 and 100 passwords will access many systems, and lists of millions are now available, for which rainbow tables are easily constructed). And the collision problem also remains.
As for assembling large amounts of storage: with distributed systems, whether on bare metal, cloud systems, or botnets, it's possible to aggregate terrebytes of memory (not merely on-disk storage) for quite modest budgets within reach of a company or moderate-high-net-worth evil genious, let alone a state actor. For commodity x86 servers, high-end memory now looks to be ~320 - 512 GB, though terrabyte range wouldn't surprise me (I think SGI are still pushing the envelope in this area in their Rackable incarnation).
Cryptographic hashes and checksums serve different purposes. CRC32c has a hardware implementation on Intel that can do between 3-4 gigabytes a second per core (depends on if you pipeline instructions) with software implementations doing 800 megabytes a second per core.
Fast cryptographic hashes are more like 100 megabytes a second per core or slower. You can use cryptographic hashes in performance sensitive scenarios like git to get a unique fixed size handle on some bytes, but CRC32c is faster if all you want to detect is random changes.
I wouldn't expect a filesystem to need collision resistant cryptographic hashes.
There is hardware support for SHA-1 and SHA-2 now, but is recent and I haven't seen an implementation or benchmark. I doubt it's as fast as the hardware CRC32c implementation.
Your estimation of BLAKE2's data/second is indeed off by an order of magnitude <https://blake2.net/ >, and the demand is certainly there to have better checksum algorithms in filesystems, both for integrity and deduplication. ZFS for example uses SHA-2.
BLAKE2's authors do say they designed a fast algorithm envisioned for storage, so that's why I think it fits btrfs.
Thanks, that's very information. I checked out the link and it is very cool how blake2 can be tuned for different roles. I hadn't though of how newer filesystems are doing deduplication, my head is stuck in the ext4 era.
I still think performance matters. Even at 800 megabytes a second you are talking about committing two entire cores to checksums on 10-gig E if you need to move data around. An entire core if you are talking about sequentially scanning an SSD. I suppose this will stop mattering as we get more cores.
If a filesystem is using CRC32 for something, it doesn't need the properties of a cryptographic hash or they are doing it wrong. I can see how you can argue against CRC for reliability.
I am not sure whether you actually risk a corrupt block every 1 in 2^32 blocks. Most blocks won't be corrupt so the CRC only has to detect a much smaller number of errors. Assuming every block had an error needing detection you will miss an error every 16 terabytes (assuming other things as well). Assuming 1% of blocks are corrupt you would miss an error every 1.6 petabytes? Maybe I am thinking about this wrong, and I recall other factors like block size effecting CRC's reliability.
Forget about collision resistance. What makes me nervous about CRC32c is that it's slower than xxhash in software, approximately the same speed as blake2 in software, yet results in a much larger 1-in-2^32 chance that data could be corrupt yet the corruption won't be detected.
I'm also not sure what sort of risks and attacks cloud storage providers have to deal with, but AWS S3 for instance computes MD5 hashes for each object. If they or any other storage providers need guarantees about collision and preimage resistance, crc32c, or even xxhash, won't suffice (and MD5 may not, either, but the fact that they haven't run screaming from it yet suggests that they don't use it in a way where its known weaknesses matter).
BLAKE2b is <3.5cpb on Sandy Bridge, or ~570mb/s at 2ghz. 5cpb in general on 64 bit x86, 400mb/s. 6-8 cpb in general on 32 bit x86 with SSSE3, 250-333mb/s. It definitely might be overkill for some situations, but it's fairly affordable overkill.
FWIW, BLAKE2 does over 1 gigabyte per second on recent Intel chips. With the parallel modes it should be able to reach 3-4 gigabytes per second with multiple cores.
For non-cryptographic applications that don't require top-grade security, but do want a short reasonably-unique identifier for a file to detect duplicates, MD5's speed and ubiquity made it a natural choice. This is the first thing I've seen aimed at restoring some security to that particular niche.
Until someone gives you two different files with the same md5sum, which are quite easy to create, and violation of your program's assumptions cause it to misbehave in unexpected ways. Most applications of hash functions probably actually do want cryptographic strength, as they will end up being used in ways you did not forsee (and thus exploited).