Hash Collision Considerations
In computing theory Collision is described as the event where two objects or records receive the same identifier or allocation in memory. The description also applies when one algorithm produces the same hash value for different entries. An algorithm considered robust must have as fundamental characteristic the “pairwise independence” , which in a simplified way means that the hash generated from two distinct entries must be different.
The hash functions do not have methods that allow the identification of the input items, or registration of the hash values generated from them. This means that although the pairwise independence feature is desirable and extremely necessary, collision is statistically possible since there are unlimited inputs and a limited (although incredibly big) number of possible outputs. The probability of collision is directly related to the composition (characters) and the size of the hash value. The greater the size of the hash value, the more difficult becomes the collision. The use of the word difficult and not impossible is adequate because the growth of computing power makes the calculations necessary to find these values more accessible. To illustrate how the collision works, consider a hash function represented by H, and any input value, for example, “Hello”, which as a result generates a hash value, 7878654, consisting of 7 (seven) characters numbers. Having the equivalent function:
H(Hello) = 7878654
The collision could happen when, for the same hash function H and a different entry, for example, “World”, the result generated is the same as that obtained for the entry “Hello” .
H(World) = 7878654
Hashing collision considerations become extremely important when storing hash values is required. Another value is necessary to make the calculation of collision possible, this being the size of the database to be used.
The problem of collision can be illustrated as follows. Being N the total value of outputs or possible hash values, and k a random number of generated values, which are numeric and less than N. What is the probability that at least two of the multiple k values are the same? Or rather, how likely are all values to be unique?
The mathematical abstraction of this question can be summed up only by subtracting the value of N, by 1. That is, given a possible output space or hash values (can be any integer value), there are values that are unique and different from the first item in the list. Therefore, the probability of generating two equal integers is: (N – 1) / N
After this, there will be possible N – 2 values that are unique and different from the first two items in the list. This means that the probability of generating 3 equal integer values is: (N – 1) / N x (N – 2) / N
Because hash algorithms produce random values within space N, which are called k, it is possible to calculate the probability of having only unique values, as follows: e(-k(k-1)/2N
By subtracting the result of the function by 1, we get the answer to the original question – “What is the probability that all values are unique?” which is the same as obtaining the probability of collision having then:
1 – (e(-k(k-1)/2N)
Large-scale calculations are performed by computers and even a relatively small 32-bit hash represent a universe of 4.294.967.296 (4.2 billion) possible entries. The python code below shows how to implement collision calculation.
#!/usr/bin/python import threading probUnique = 1.0 #Constant - probability of having two equal numbers def domath(n,k): t = threading.Thread(target=collision, args=(n,k)) t.start() def collision(N,k): global probUnique probUnique = probUnique * (N - (k - 1)) / N probability = probUnique - 1 probPercentage = abs(round(probability * 100,2)) print('Registro %s tem %s ou %s %% chances de colisao' % (k, probability, probPercentage)) N = 18446744073709551616 #Universe of possible hashes = 2^64 or 16^16 for k in xrange(1, 5000000000): #5Billion domath(N,k)
After applying the algorithm and testing the probability of collision, it is possible to state that regardless of the size of the hash, the probability for a collision reaches 50% when dividing the exponent by 2. Thus, the probability of having two equal numbers or entries in a database with 256 (2^8) positions reaches 50% by having about 16 (2^4) generated values, which is exactly half (8/2) the exponent of 256. Cool huh?! So, when thinking about database size and possibility of collision, try to size it based on half of the hash total space.