-
Notifications
You must be signed in to change notification settings - Fork 305
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
b75c360
commit a9dfe15
Showing
8 changed files
with
796 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Introduction | ||
|
||
A hash table is a data structure which offers a fast implementation of the | ||
associative array [API](#api). As the terminology around hash tables can be | ||
confusing, I've added a summary [below](#terminology). | ||
|
||
A hash table consists of an array of 'buckets', each of which stores a key-value | ||
pair. In order to locate the bucket where a key-value pair should be stored, the | ||
key is passed through a hashing function. This function returns an integer which | ||
is used as the pair's index in the array of buckets. When we want to retrieve a | ||
key-value pair, we supply the key to the same hashing function, receive its | ||
index, and use the index to find it in the array. | ||
|
||
Array indexing has algorithmic complexity `O(1)`, making hash tables fast at | ||
storing and retrieving data. | ||
|
||
Our hash table will map string keys to string values, but the principals | ||
given here are applicable to hash tables which map arbitrary key types to | ||
arbitrary value types. Only ASCII strings will be supported, as supporting | ||
unicode is non-trivial and out of scope of this tutorial. | ||
|
||
## API | ||
|
||
Associative arrays are a collection of unordered key-value pairs. Duplicate keys | ||
are not permitted. The following operations are supported: | ||
|
||
- `search(a, k)`: return the value `v` associated with key `k` from the | ||
associative array `a`, or `NULL` if the key does not exist. | ||
- `insert(a, k, v)`: store the pair `k:v` in the associative array `a`. | ||
- `delete(a, k)`: delete the `k:v` pair associated with `k`, or do nothing if | ||
`k` does not exist. | ||
|
||
## Setup | ||
|
||
To set up C on your computer, please consult [Daniel Holden's](@orangeduck) | ||
guide in the [Build Your Own | ||
Lisp](http://www.buildyourownlisp.com/chapter2_installation) book. Build Your | ||
Own Lisp is a great book, and I recommend working through it. | ||
|
||
## Code structure | ||
|
||
Code should be laid out in the following directory structure. | ||
|
||
``` | ||
. | ||
├── build | ||
└── src | ||
├── hash_table.c | ||
├── hash_table.h | ||
├── prime.c | ||
└── prime.h | ||
``` | ||
|
||
`src` will contain our code, `build` will contain our compiled binaries. | ||
|
||
## Terminology | ||
|
||
There are lots of names which are used interchangeably. In this article, we'll | ||
use the following: | ||
|
||
- Associative array: an abstract data structure which implements the | ||
[API](#api) described above. Also called a map, symbol table or | ||
dictionary. | ||
|
||
- Hash table: a fast implementation of the associative array API which makes | ||
use of a hash function. Also called a hash map, map, hash or | ||
dictionary. | ||
|
||
Associative arrays can be implemented with many different underlying data | ||
structures. A (non-performant) one can be implemented by simply storing items in | ||
an array, and iterating through the array when searching. Associative arrays and | ||
hash tables are often confused because associative arrays are so often | ||
implemented as hash tables. | ||
|
||
Next section: [Hash table structure](/hash-table) | ||
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Hash table structure | ||
|
||
Our key-value pairs (items) will each be stored in a `struct`: | ||
|
||
```c | ||
// hash_table.h | ||
typedef struct ht_item { | ||
char* key; | ||
char* value; | ||
} ht_item; | ||
``` | ||
|
||
Our hash table stores an array of pointers to items, and some details about its | ||
size and how full it is: | ||
|
||
```c | ||
// hash_table.h | ||
typedef struct { | ||
int size; | ||
int count; | ||
ht_item** items; | ||
} ht_hash_table; | ||
``` | ||
|
||
## Initialising and deleting | ||
|
||
We need to define initialisation functions for `ht_item`s. This function | ||
allocates a chunk of memory the size of an `ht_item`, and saves a copy of the | ||
strings `k` and `v` in the new chunk of memory. The function is marked as | ||
`static` because it will only ever be called by code internal to the hash table. | ||
|
||
```c | ||
// hash_table.c | ||
#include <stdlib.h> | ||
#include <string.h> | ||
|
||
#include "hash_table.h" | ||
|
||
static ht_item* ht_new_item(const char* k, const char* v) { | ||
ht_item* i = malloc(sizeof(ht_item)); | ||
i->key = strdup(k); | ||
i->value = strdup(v); | ||
return i; | ||
} | ||
``` | ||
`ht_new` initialises a new hash table. `size` defines how many items we can | ||
store. This is fixed at 53 for now. We'll expand this in the section on | ||
[resizing](/resizing). We initialise the array of items with `calloc`, which | ||
fills the allocated memory with `NULL` bytes. A `NULL` entry in the array | ||
indicates that the bucket is empty. | ||
```c | ||
// hash_table.c | ||
ht_hash_table* ht_new() { | ||
ht_hash_table* ht = malloc(sizeof(ht_hash_table)); | ||
ht->size = 53; | ||
ht->count = 0; | ||
ht->items = calloc((size_t)ht->size, sizeof(ht_item*)); | ||
return ht; | ||
} | ||
``` | ||
|
||
We also need functions for deleting `ht_item`s and `ht_hash_tables`, which | ||
`free` the memory we've allocated, so we don't cause [memory | ||
leaks](https://en.wikipedia.org/wiki/Memory_leak). | ||
|
||
```c | ||
// hash_table.c | ||
static void ht_del_item(ht_item* i) { | ||
free(i->key); | ||
free(i->value); | ||
free(i); | ||
} | ||
|
||
|
||
void ht_del_hash_table(ht_hash_table* ht) { | ||
for (int i = 0; i < ht->size; i++) { | ||
ht_item* item = ht->items[i]; | ||
if (item != NULL) { | ||
ht_del_item(item); | ||
} | ||
} | ||
free(ht->items); | ||
free(ht); | ||
} | ||
``` | ||
We have written code which defines a hash table, and lets us create and destroy | ||
one. Although it doesn't do much at this point, we can still try it out. | ||
```c | ||
// main.c | ||
#include "hash_table.h" | ||
int main() { | ||
ht_hash_table* ht = ht_new(); | ||
ht_del_hash_table(ht); | ||
} | ||
``` | ||
|
||
Next section: [Hash functions](/hashing) | ||
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Hash function | ||
|
||
In this section, we'll write our hash function. | ||
|
||
The hash function we choose should: | ||
|
||
- Take a string as its input and return a number between `0` and `m`, our | ||
desired bucket array length. | ||
- Return an even distribution of bucket indexes for an average set of inputs. If | ||
our hash function is unevenly distributed, it will put more items in some | ||
buckets than others. This will lead to a higher rate of | ||
[collisions](#collisions). Collisions reduce the efficiency of our hash table. | ||
|
||
## Algorithm | ||
|
||
We'll make use of a generic string hashing function, expressed below in | ||
pseudocode. | ||
|
||
``` | ||
function hash(string, a, num_buckets): | ||
hash = 0 | ||
string_len = length(string) | ||
for i = 0, 1, ..., string_len: | ||
hash += (a ** string_len - (i+1)) * char_code(string[i]) | ||
hash = hash % num_buckets | ||
return hash | ||
``` | ||
|
||
This hash function has two steps: | ||
|
||
1. Convert the string to a large integer | ||
2. Reduce the size of the integer to a fixed range by taking its remainder `mod` | ||
`m` | ||
|
||
The variable `a` should be a prime number larger than the size of the alphabet. | ||
We're hashing ASCII strings, which has an alphabet size of 128, so we should | ||
choose a prime larger than that. | ||
|
||
`char_code` is a function which returns an integer which represents the | ||
character. We'll use ASCII character codes for this. | ||
|
||
Let's try the hash function out: | ||
|
||
``` | ||
hash("cat", 151, 53) | ||
hash = 151**2 * 99 + 151**1 * 97 + 151**0 * 116 % 53 | ||
hash = 2257299 + 14647 + 116 % 53 | ||
hash = 2272062 % 53 | ||
hash = 5 | ||
``` | ||
|
||
Changing the value of `a` give us a different hash function. | ||
|
||
``` | ||
hash("cat", 163, 53) = 3 | ||
``` | ||
|
||
## Implementation | ||
|
||
```c | ||
// hash_table.c | ||
static int ht_hash(const char* s, const int a, const int m) { | ||
long hash = 0; | ||
const int len_s = strlen(s); | ||
for (int i = 0; i < len_s; i++) { | ||
hash += (long)pow(a, len_s - (i+1)) * s[i]; | ||
hash = hash % m; | ||
This comment has been minimized.
Sorry, something went wrong. |
||
} | ||
return (int)hash; | ||
} | ||
``` | ||
## Pathological data | ||
An ideal hash function would always return an even distribution. However, for | ||
any hash function, there is a 'pathological' set of inputs, which all hash to | ||
the same value. To find this set of inputs, run a large set of inputs through | ||
the function. All inputs which hash to a particular bucket form a pathological | ||
set. | ||
The existence of pathological input sets means there are no perfect hash | ||
functions for all inputs. The best we can do is to create a function which | ||
performs well for the expected data set. | ||
Pathological inputs also poses a security issue. If a hash table is fed a set of | ||
colliding keys by some malicious user, then searches for those keys will take | ||
much longer (`O(n)`) than normal (`O(1)`). This can be used as a denial of | ||
service attack against systems which are underpinned by hash tables, such as DNS | ||
and certain web services. | ||
Next section: [Handling collisions](/collisions) | ||
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
## Handling collisions | ||
|
||
Hash functions map an infinitely large number of inputs to a finite number of | ||
outputs. Different input keys will map to the same array index, causing | ||
bucket collisions. Hash tables must implement some method of dealing with | ||
collisions. | ||
|
||
Our hash table will handle collisions using a technique called open addressing | ||
with double hashing. Double hashing makes use of two hash functions to | ||
calculate the index an item should be stored at after `i` collisions. | ||
|
||
For an overview of other types of collision resolution, see the | ||
[appendix](/07-appendix). | ||
|
||
## Double hashing | ||
|
||
The index that should be used after `i` collisions is given by: | ||
|
||
``` | ||
index = hash_a(string) + i * hash_b(string) % num_buckets | ||
``` | ||
|
||
We see that if no collisions have occurred, `i = 0`, so the index is just | ||
`hash_a` of the string. If a collision happens, the index is modified by the | ||
`hash_b`. | ||
|
||
It is possible that `hash_b` will return 0, reducing the second term to 0. This | ||
will cause the hash table to try to insert the item into the same bucket over | ||
and over. We can mitigate this by adding 1 to the result of the second hash, | ||
making sure it's never 0. | ||
|
||
``` | ||
index = hash_a(string) + i * (hash_b(string) + 1) % num_buckets | ||
``` | ||
|
||
## Implementation | ||
|
||
```c | ||
// hash_table.c | ||
static int ht_get_hash( | ||
const char* s, const int num_buckets, const int attempt | ||
) { | ||
const int hash_a = ht_generic_hash(s, HT_PRIME_1, num_buckets); | ||
const int hash_b = ht_generic_hash(s, HT_PRIME_2, num_buckets); | ||
return (hash_a + (attempt * (hash_b + 1))) % num_buckets; | ||
} | ||
``` | ||
|
||
Next section: [Hash table methods](/methods) | ||
[Table of contents](https://github.com/jamesroutley/write-a-hash-table#contents) |
Oops, something went wrong.
From the pseudocode written before, this should be outside the for loop