Hashing

Hashing is the act of reducing the set of possible values to a smaller sub set of possible values using a mapping that is produced by some function. The function is referred to as a hash function and the product of that function is referred to as a hash code. The associative array of hash codes to their values is known as a hash table. An example of a hash function would simply be to use a modulus operation.

Consider the following example:

A 64-bit architecture can theoretically handle $2^{64} - 1$ addresses (whether or not this is true is dependent on the operating system). In most cases, systems are not dealing with 18 Exabytes of random access memory (RAM). In this case we want to map an address outside of our systems available memory to an address that is in our available memory space. Let’s say that the system has 8 gigabytes of ram (8Gb). If we wanted to map to the address at 10Gb we need only apply the division hash function as follows:

$$ 1\qquad h(k) = k \mod m $$

Where $h$ is a hash function, $k$ is a key to map to some subset of the universal space, and $m$ is value that divides $k$. This hash function will return the remainder of the division, which is some value less than the value $m$ but greater than or equal to zero (0).

So, in applying the division hash function to our example, that looks like:

$h(10Gb) = 10Gb\ mod\ 8Gb$

Which is equivalent to:

$h(10*2^{30}) = (10*2^{30})\ mod\ (8*2^{30})$

Written out:

$10/8 = 1, remainder = 2$

Thus, $h(10gb)$ would map to the 2Gb address in an 8Gb address space. Of course the operating system handles most of this for you. Whether that is good or bad, it’s how things are. Most programmers work in a virtual address space that is theoretically limitless and the OS decides how to deal with those addresses when the program is executed.

The remainder of this note is dedicated to further discussions on the topic of hashing now that you have a basic understanding of what hashing is and it’s intended purpose.


Non-Deterministic Hashing

In consideration of which hash function to use in a particular situation in order to best restrict the subset of possible values with the least likelihood that there will be key $k$ such that the hash of $k_i$ will not be equivalent to the hash of $k_j$. This issue is know as a hash collision and there will be some discussion about ways to avoid this issue entirely and techniques for dealing with it when it does occurs.

We begin by developing a function generator that can be used to create a set of hash functions that we can use. The function generator that follows is know as a Universal Hash Function:

$$ 2\qquad h_{ab}(k) = (((ak+b)\ mod\ p)\ mod\ m) \\ 3\qquad H(p,m) = \{h_{ab}(k) \ | \ a, b \in \{0, .., p - 1\} \ \& \ a \neq 0\} $$

Equation 2 is a generalization of equation 1, stating that a mapping of some $k$ from space $a$ onto $b$ is generalized by scaling the key by its space value and adding the subspace value before preforming the division by some value $p$ that is greater than a and b and then returning the remainder of the division of that remainder by the value $m$ which is the value of the subspace s range. Equation 2 is generalized by equation 3 that states: The hash function generator will take a value for the first mod $p$ and the value of the second mod $m$, $m$ being the subspace and further states that the values $a$ and $b$ must be in the range $[0, (p-1)]$

Proof of Universality

Let us prove that this function generator will provide us with a set of hash functions that will each be, at most, equally likely to produce a hash collision. This is the Universial property of the universal hash function generator. To prove this we will have to insure the following is true:

$ P_{h \in H}\ \{h(k_i) = h(k_j) \leq {1\over m}\} \qquad \forall\quad k_i \neq k_j \in \{0, .., n-1\} $

This means that the probability of a hash code of two keys being hashed, in a hash function that was generated with equations 2 and 3, are the same value is less than or equal to 1 divided by the amount of values in the subspace. This insures that the function’s outputs are well distributed across the subspace and so the amount of possible collisions is minimal.

Test of Proof

In order to test this proof, we will use an Indicator Random Variable; which is a value that is randomly selected that complies to all the above restrictions and with the applied probability will be 1, meaning that the proof holds.

Let the random valuable be $X_{ij}$ representing the probability that, given a choice of $h \in H$. $X_{ij} = 1$ if $h(k_i) = h(k_j)$, and 0 otherwise. Also, let the size of the values at a hash code be as follows:

$$ h(k_i) = X_i = \sum_{j=0}^{n-1}X_{ij} $$

Our expectation is that the likelihood of $X_{ij} = 1$ is ${1 \over m}$

See that:

$$ E_{h \in H} \{X_i\} = E_{h \in H}\ \{\sum_j X_{ij}\} \\ = (\sum_{j \neq i}\{E\{X_{ij}\}) + 1 \\ replace: \\ = (\sum_{j \neq i}\{{1 \over m}\}) + 1 \\ = 1 + {n - 1 \over m}\qquad . $$

All that basically to say that if the hash function generator is truly “universial” then it should be the case that any randomly chosen function with any randomly chosen value that is within the parameters should produce an evenly distributed hash code that should have a constant amount of possible collisions for any given hash code in the subset.