Hashtable




 
Hash table data structure using the C programming language. I briefly demonstrate linear and binary search, and then design and implement a hash table. My goal is to show that hash table internals are not scary, but – within certain constraints – are easy enough to build from scratch.
Go to: Linear search | Binary search | Hash tables | Implementation | Discussion
 simple program that counts word frequencies across various languages, and one of the things that came up was how C doesn’t have a hash table data structure in its standard library.


There are many things you can do when you realize this: use linear search, use binary search, grab someone else’s hash table implementation, or write your own hash table. Or switch to a richer language. We’re going to take a quick look at linear and binary search, and then learn how to write our own hash table. This is often necessary in C, but it can also be useful if you need a custom hash table when using another language.

Linear search
The simplest option is to use linear search to scan through an array. This is actually not a bad strategy if you’ve only got a few items – in my simple comparison using strings, it’s faster than a hash table lookup up to about 7 items (but unless your program is very performance-sensitive, it’s probably fine up to 20 or 30 items). Linear search also allows you to append new items to the end of the array. With this type of search you’re comparing an average of num_keys/2 items.

Let’s say you’re searching for the key bob in the following array (each item is a string key with an associated integer value):

Index 0 1 2 3 4 5 6
Key foo bar bazz buzz bob jane x
Value 10 42 36 7 11 100 200
You simply start at the beginning (foo at index 0) and compare each key. If the key matches what you’re looking for, you’re done. If not, you move to the next slot. Searching for bob takes five steps (indexes 0 through 4).

Here is the algorithm in C (assuming each array item is a string key and integer value):

typedef struct {
    char* key;
    int value;
} item;

item* linear_search(item* items, size_t size, const char* key) {
    for (size_t i=0; i<size; i++) {
        if (strcmp(items[i].key, key) == 0) {
            return &items[i];
        }
    }
    return NULL;
}

int main(void) {
    item items[] = {
        {"foo", 10}, {"bar", 42}, {"bazz", 36}, {"buzz", 7},
        {"bob", 11}, {"jane", 100}, {"x", 200}};
    size_t num_items = sizeof(items) / sizeof(item);

    item* found = linear_search(items, num_items, "bob");
    if (!found) {
        return 1;
    }
    printf("linear_search: value of 'bob' is %d\n", found->value);
    return 0;
}
Binary search
Another simple approach is to put the items in an array which is sorted by key, and use binary search to reduce the number of comparisons. This is kind of how we might look something up in a (paper) dictionary.

C even has a bsearch function in its standard library. Binary search is reasonably fast even for hundreds of items (though not as fast as a hash table), because you’re only comparing an average of log(num_keys) items. However, because the array needs to stay sorted, you can’t insert items without copying the rest down, so insertions still require an average of num_keys/2 operations.

Assume we’re looking up bob again (in this pre-sorted array):

Index 0 1 2 3 4 5 6
Key bar bazz bob buzz foo jane x
Value 42 36 11 7 10 100 200
With binary search, we start in the middle (buzz), and if the key there is greater than what we’re looking for, we repeat the process with the lower half. If it’s greater, we repeat the process with the higher half. In this case it results in three steps, at indexes 3, 1, 2, and then we have it. This is 3 steps instead of 5, and the improvement over linear search gets (exponentially) better the more items you have.

Here’s how you’d do it in C (with and withoutbsearch). The definition of the item struct is the same as above.

int cmp(const void* a, const void* b) {
    item* item_a = (item*)a;
    item* item_b = (item*)b;
    return strcmp(item_a->key, item_b->key);
}

item* binary_search(item* items, size_t size, const char* key) {
    if (size + size < size) {
        return NULL; // size too big; avoid overflow
    }
    size_t low = 0;
    size_t high = size;
    while (low < high) {
        size_t mid = (low + high) / 2;
        int c = strcmp(items[mid].key, key);
        if (c == 0) {
            return &items[mid];
        }
        if (c < 0) {
            low = mid + 1; // eliminate low half of array
        } else {
            high = mid; // eliminate high half of array
        }
    }
    // Entire array has been eliminated, key not found.
    return NULL;
}

int main(void) {
    item items[] = {
        {"bar", 42}, {"bazz", 36}, {"bob", 11}, {"buzz", 7},
        {"foo", 10}, {"jane", 100}, {"x", 200}};
    size_t num_items = sizeof(items) / sizeof(item);

    item key = {"bob", 0};
    item* found = bsearch(&key, items, num_items, sizeof(item), cmp);
    if (found == NULL) {
        return 1;
    }
    printf("bsearch: value of 'bob' is %d\n", found->value);

    found = binary_search(items, num_items, "bob");
    if (found == NULL) {
        return 1;
    }
    printf("binary_search: value of 'bob' is %d\n", found->value);
    return 0;
}

Posted on by