Hashes should never be a source of randomness. Randomness makes assumptions far outside their intended use case.
Hashes should only be a reproducible label that cannot be used to produce the material described by the hash. When used for their intended purposes hashes serve as the strongest point of integrity until value collisions are discovered.
But once you've made a function that "cannot be used to produce the material described by the hash", you've also made a very good pseudo-randomizer. In fact, if a cryptographic hash function cannot be trusted for its ability to produce apparent randomness, then it cannot be trusted for its "intended purposes". You get both properties or neither.
There is an untested assumption that hashes achieve randomness because they appear to be a random collection of characters. Hash sequences are completely reproducible given a set of input, and that is by definition not random.
I think you are confusing loss of prediction as randomness. Never in mathematics or logic is that line of thinking correct. This can be described by equivocation, fallacy of composition, inductive fallacy, and more.
I think you are mixing the function itself and it's output, if for a given input to the function the output is uniformly random, then this is a way to derive randomness. The fact that the function itself is deterministic tells you nothing about the distribution of it's output.
You started your comment with a correct distinction, but got the wrong conclusion. Asking if somthing is random is actually a question about the process used to obtain some value and not the value itself. If I ask you if 42 is a random number, can you actually answer? I can get that number with an intrinsically random process based on some quantum effect, or I can say it from the top of my head because I just read a very famous book. You can indeed use an hash function to extract randomness, but, to be precise, we are talking about pseudo-randomness. The crucial difference here is that, if I'm measuring photons' polarization to get a random number, then an attacker repeating the same process will (very likely) obtain a different number. If I'm instead hashing some data, an attacker passing the same data through the same hash function will get the same result. Another example: if I hash the digits of pi, I will get a sequence that will pass statistical tests for randomness. But an attacker that knows how I am generating such sequence can easily reproduce it.
You're saying something being reproducible implies that it is not pseudo-random. This is a definition of pseudo-random that exists only in your head.
The first sentence of the wikipedia entry on pseudo-randomness is:
"A pseudorandom sequence of numbers is one that appears to be statistically random, despite having been produced by a completely deterministic and repeatable process."
That misses a factor called good enough, or degree of predictability. Ultimately everything eventually becomes predictable if analyzed deeply enough. Another word for that is entropy. That is what my linked comment referred to. For security concerns, such as PRNGs, the appearance of randomness is not enough.
This is why security analysis requires a higher threshold than software employment at large.
A hash function cannot create entropy. Let's be clear about that.
A good hash function will preserve entropy, up to the length of its output. If the input X has K bits of entropy and H is an N-bit cryptographic hash function, then the entropy of H(X) is min(K, N). In simpler terms, GIGO.
However, a hash function also scrambles its input, which means the output is indistinguishable from (uniform) random noise. This is the randomizing property I was talking about. It is good enough for hash functions to be used to build stronger primitives, like HMACs, PBKDFs, CSPRNGs, etc. There are many formalizations of this property, but one of the simplest is that given any K bits of the output, you cannot predict the other N-K output bits any better than guessing, even knowing the algorithm.
Of course, if you know the input to a hash function, you can predict the output perfectly. But if you don't know the input, the hash looks like random noise, and for cryptographic hash functions, this is a very strong and fundamental guarantee indeed.
> lol, no. Cryptographic hash functions are specifically designed to achieve this property.
That completely ignores the definition of the word random.
What I find most interesting about this thread of comments is that the article explains the failure of using hashes as a means of randomness and despite that failure people are eager to ignore what hashes are otherwise used for to invent oppositional arguments. it's weird.
It's a lot less weird if you consider the possibility that you don't understand this as well as you think, and the reason people are consistently correcting you is because you are mistaken.
You may want to stay away from all modern CSPRNGs then. Eg Yarrow and Fortuna rely on sources of random input data being mixed in but using a strong hash function (nowadays sha-256) to produce the output at arbitrarily fast rates without consuming entropy.
And to your criticism that this is just programmers who don’t know what they’re doing, these algorithms were developed by Bruce Schneier, Niels Ferguson, and John Kelsey.
I was so frustrated as a noob trying to understand why people were using hashes this way. Even a professional told me "yeah but a collision is really unlikely," and compared it to neutrino interference. How is that supposed to be good enough?
Hash functions are used to instantiate a random oracle (which is a theoretical object that can't be instantiated because it would be of infinite size but makes it easy to reason about) because it doesn't seems crazy as an assumption that if finding a collision between 2 hashes is hard it should be hard to predict the output of the so called hash function. However it is well known that there was some contrive counter example for protocols that are secure under the Random Oracle model and unsecure when instanciated with any hash function. The problem with this paper is that the protocol it described isn't so contrive anymore.
Cryptography is a matter of assumptions and what you believe in or not. You might want to not use random oracle but you will therefore have to restrict yourself in what you can concretely build.
And the reason behind the problem outlined in the paper isn't a biased randomness problem but the fact that you can represent the hash function compared to a RO.
I'm sorry, but this comment is very vague and unclear.
Cryptographers know that hashes (even cryptographically strong ones!) are deterministic. Yet, it is possible that in going from an interactive proof to a non-interactive one, one does not actually need randomness. Indeed, for some class of protocols, we know how to design hash functions satisfying a particular property (correlation intractability) so that the resulting non-interactive proof is sound. It's just that (a) these hashes are inefficient, and (b) until now no one had found a non-contrived protocol where using standard hashes leads to an attack.
You should look into the HyperLogLog algorithm, where fair hash "randomness" is required for the algorithm to work. There are use cases where the pseudo-randomness of hashes is useful, is what I'm trying to say.
This is why you should NEVER trust software developers to make security decisions unless certified to do so. True randomness is challenging to achieve, because computers are inherently predictable. Pseudo-randomness is an intended process to intentionally achieve randomness in spite of this high predictability, often through use of physical or electromagnetic criteria outside the computing machine.
Hash algorithms are none of that. They are not pseudo-randomness merely because a software developer merely wishes them to be so. Hash algorithms are intentionally designed to achieve high reproducibility in that a given set of input should always result in the same hash sequence as output. That intended reproducibility is by definition not random.
You don't understand what pseudo-randomness means. Virtually all PRNGs, even many CSPRNGs, have a way to initialize the algorithm with a seed, and its outputs are fully predictable based on that seed. Choosing a truly random seed, such as one produced by RNG hardware, will lead to a usefully random sequence - but the algorithm is still fully deterministic based on that seed.
>True randomness is challenging to achieve, because computers are inherently predictable
Most modern CPUs now contain a true RNG. They usually use some combination of a metastable latch, or thermal randomness through some kind of analog amplification. Bit strings from this are passed into a pseudorandom number generator to amplify the amount of random data generated.
There probably attacks on this too but it's much harder.
More accurately, the CPU RNG instruction is generally considered untrusted by itself and the only reason it's used is that kernel RNGs are CSPRNGS based on cryptographic hashes (which is what the CS refers to - cryptographically secure) where mixing in a corrupted bit stream along with uncompromised bit streams still results in an uncompromised bit stream out. No one uses the CPU RNG instruction directly (both for security & also secondary perf reasons)
What? You’ve managed to mangle so many terms in so few words… Signatures can refer to two things: integrity checks on a file or authentication checks for a recieved file. In the integrity check situation a hash function (e.g., SHA) is often used. In the authentication check situation, we usually use a public/private keypair for asymmetric encryption; the hash function is only part of the process. The key material used to make this keypair (should) comes from some random number generator…
The ‘hash’ function is a deterministic transform, not a source of randomness.
He is technically not wrong, most signatures can be seen has a public coin interactive proof system where you prove knowledge of a private key.
They are then compiled into an non-interactive proof system via the Fiat-Shamir transform that uses a random oracle concretely instantiated using a hash function (easy to see in Schnorr signature).
So at the end you are using a Hash function to generate your random coin.
Internally, most signature algorithms use hash functions. RSA-PSS, EdDSA and ML-DSA use them to provide something like randomness, and the security analysis of those signature schemes includes arguments assuming (in some very particular, technical ways) that the hash function outputs "look random".
Classical DSA and ECDSA do not use hash functions this way, but in my opinion they aren't stronger for it: they're basically assuming instead that some other mathematical function "looks random", which seems riskier than assuming that about a hash function. I've heard that the reason for this is to get around Schnorr's patent on doing it with hash functions, which has since expired.
The SHA3 and SHAKE hash functions (underlying e.g. ML-DSA) are explicitly designed to "look random" as well.
There are some signature schemes that try not to make such strong assumptions: in particular SLH-DSA targets properties more like first- and second-preimage resistance, target-collision-resistance, and so on.
All the algorithms you mention are PKI. RSA uses two large prime numbers. I don't see what hash sequences have to do with this at all.
PKI isn't even really about randomness. RSA does use a kind of randomness to generate its large primes, but that is beneficial and not required. The primary consideration is the math to reverse guess a factor of two primes or the square root of a large number, or something else computers currently find cheap to compute in one way but extremely expensive to reverse.
The intro textbook descriptions of cryptographic systems omit a lot of very important details.
When using RSA to sign a message m, in practice you don't send m^d mod N. That would generally be insecure, depending on what kinds of messages your system sends and/or accepts. In practical systems, instead you hash m, and then adjust the hash through a (possibly randomized) process called "padding" to be a value in [0,N). There are different standards for padding, and the better designs use additional hashing.
The security of the system depends in part on the hashed-then-padded message "looking random", i.e. not having structure that can be exploited by an attacker. It turns out to be tricky to formalize what exact randomness property you need, so cryptosystems are often analyzed in the "random oracle model" (ROM) in which the hash function has impossibly strong randomness properties.
It seems that usually, if you use a strong hash function, a scheme that's proved secure in the ROM is secure in real life (or at least it's not the ROM part that breaks); the counterexamples are usually really contrived. This article is about a somewhat-less-contrived, but still not quite realistic, example where something that's secure in the ROM would break due to the ROM being an unrealistic model.
It's not wrong. The only thing preventing me from forging your certificate is my inability to generate a new cert which hashes to the same digest as what is in your cert's signature. I don't actually need the keys if I break the hash.
EDIT2: I'm doing a bad job of explaining this... you obviously need the keypair associated with the cert to initiate connections with it and not trigger MITM alerts. But if you break the hash function, you don't need the private key from the root cert, the verbatim signature from the original cert will appear to be valid when spliced into your forged cert if the hash digest computation on the forged cert is the same.
And checking the validity of a certificate involves checking a signature of... The certificate's hash. If you can break the underlying hash function, then the trust chain is broken.
Hashes should only be a reproducible label that cannot be used to produce the material described by the hash. When used for their intended purposes hashes serve as the strongest point of integrity until value collisions are discovered.