C语言哈希表设计与碰撞解决算法对比

哈希表基础概念

什么是哈希表

哈希表（Hash Table），也叫散列表，是根据关键码值（Key-Value）而直接进行访问的数据结构。它通过一个哈希函数（Hash Function），将键映射到一个表中的索引位置，从而可以在接近常数时间复杂度内实现数据的插入、查找和删除操作。

在C语言中，我们可以通过数组和链表等数据结构来构建哈希表。哈希表的基本思想是：给定一个记录的关键字 key，通过一个哈希函数 h(key) 计算出该记录在哈希表中的存储位置。例如，假设有一个简单的哈希函数 h(key) = key % 10，对于关键字 23，通过该哈希函数计算得到的位置为 23 % 10 = 3，那么这个记录就可以存储在哈希表数组的第 3 个位置。

哈希函数的设计原则

一致性：相同的输入总是产生相同的输出。如果对于同一个 key，哈希函数每次计算出来的哈希值都不一样，那么哈希表就无法正常工作。例如，无论何时计算 h(10)，它都应该返回相同的值。
高效性：计算哈希值的过程应该尽量高效，不应该有过于复杂的计算逻辑，以免影响哈希表的整体性能。简单的取模运算（如 key % size）就是一种高效的哈希函数计算方式，这里 size 是哈希表的大小。
均匀分布：理想情况下，哈希函数应该将不同的 key 均匀地映射到哈希表的各个位置，减少碰撞的发生。例如，如果所有的 key 经过哈希函数计算后都集中在哈希表的某几个位置，那么这些位置就会频繁发生碰撞，降低哈希表的性能。

下面是一个简单的哈希函数示例：

// 简单的哈希函数，将字符串转换为一个数值，再进行取模
unsigned int hash_function(const char *str, int table_size) {
    unsigned int hash = 0;
    while (*str) {
        hash = (hash << 5) + hash + *str++;
    }
    return hash % table_size;
}

这个哈希函数首先将字符串通过位运算和加法转换为一个数值，然后对哈希表大小取模，得到最终的哈希值。

哈希表设计

哈希表的数据结构定义

在C语言中，我们通常使用结构体来定义哈希表。一个简单的哈希表结构体可能包含一个数组，数组的每个元素指向一个链表，用于处理碰撞。如下所示：

// 定义链表节点
typedef struct HashNode {
    char *key;
    int value;
    struct HashNode *next;
} HashNode;

// 定义哈希表
typedef struct HashTable {
    HashNode **table;
    int size;
} HashTable;

这里，HashNode 结构体定义了链表节点，每个节点包含一个键（key）、一个值（value）和一个指向下一个节点的指针（next）。HashTable 结构体包含一个指向 HashNode 指针数组的指针（table）和哈希表的大小（size）。

哈希表的初始化

初始化哈希表时，我们需要分配内存给 table 数组，并将每个数组元素初始化为 NULL。

HashTable* create_hash_table(int size) {
    HashTable *hash_table = (HashTable *)malloc(sizeof(HashTable));
    hash_table->size = size;
    hash_table->table = (HashNode **)malloc(size * sizeof(HashNode *));
    for (int i = 0; i < size; i++) {
        hash_table->table[i] = NULL;
    }
    return hash_table;
}

这个函数首先为 HashTable 结构体分配内存，然后为 table 数组分配内存，并将每个元素初始化为 NULL。

哈希表的插入操作

插入操作需要先计算键的哈希值，找到对应的位置，然后将新节点插入到链表中。

void insert(HashTable *hash_table, const char *key, int value) {
    unsigned int hash = hash_function(key, hash_table->size);
    HashNode *node = hash_table->table[hash];
    // 检查是否已存在相同键
    while (node != NULL) {
        if (strcmp(node->key, key) == 0) {
            node->value = value;
            return;
        }
        node = node->next;
    }
    // 如果不存在，创建新节点并插入
    HashNode *new_node = (HashNode *)malloc(sizeof(HashNode));
    new_node->key = strdup(key);
    new_node->value = value;
    new_node->next = hash_table->table[hash];
    hash_table->table[hash] = new_node;
}

该函数首先计算键的哈希值，然后遍历对应的链表，检查是否已存在相同键的节点。如果存在，则更新其值；如果不存在，则创建新节点并插入到链表头部。

哈希表的查找操作

查找操作同样需要计算哈希值，然后在对应的链表中查找键。

int search(HashTable *hash_table, const char *key) {
    unsigned int hash = hash_function(key, hash_table->size);
    HashNode *node = hash_table->table[hash];
    while (node != NULL) {
        if (strcmp(node->key, key) == 0) {
            return node->value;
        }
        node = node->next;
    }
    return -1; // 未找到
}

此函数计算哈希值后，遍历链表查找键。如果找到，则返回对应的值；如果未找到，则返回 -1。

碰撞解决算法

开放定址法

开放定址法（Open Addressing）是一种解决哈希表碰撞的方法。当发生碰撞时，它会在哈希表中寻找下一个可用的位置。

线性探测法 线性探测法（Linear Probing）是最简单的开放定址法。当发生碰撞时，它会线性地检查下一个位置，直到找到一个空位置。例如，假设哈希函数计算出的位置为 index，如果 index 位置已被占用，那么它会检查 index + 1，index + 2，以此类推。

// 线性探测法的哈希表结构体
typedef struct LinearProbingHashTable {
    char **keys;
    int *values;
    int size;
    int count;
} LinearProbingHashTable;

// 初始化线性探测法的哈希表
LinearProbingHashTable* create_linear_probing_hash_table(int size) {
    LinearProbingHashTable *hash_table = (LinearProbingHashTable *)malloc(sizeof(LinearProbingHashTable));
    hash_table->size = size;
    hash_table->count = 0;
    hash_table->keys = (char **)malloc(size * sizeof(char *));
    hash_table->values = (int *)malloc(size * sizeof(int));
    for (int i = 0; i < size; i++) {
        hash_table->keys[i] = NULL;
    }
    return hash_table;
}

// 线性探测法的插入操作
void linear_probing_insert(LinearProbingHashTable *hash_table, const char *key, int value) {
    if (hash_table->count >= hash_table->size * 0.75) {
        // 负载因子过高，进行扩容
        // 此处省略扩容代码
    }
    unsigned int hash = hash_function(key, hash_table->size);
    while (hash_table->keys[hash] != NULL) {
        if (strcmp(hash_table->keys[hash], key) == 0) {
            hash_table->values[hash] = value;
            return;
        }
        hash = (hash + 1) % hash_table->size;
    }
    hash_table->keys[hash] = strdup(key);
    hash_table->values[hash] = value;
    hash_table->count++;
}

// 线性探测法的查找操作
int linear_probing_search(LinearProbingHashTable *hash_table, const char *key) {
    unsigned int hash = hash_function(key, hash_table->size);
    while (hash_table->keys[hash] != NULL) {
        if (strcmp(hash_table->keys[hash], key) == 0) {
            return hash_table->values[hash];
        }
        hash = (hash + 1) % hash_table->size;
    }
    return -1; // 未找到
}

线性探测法的优点是简单直观，实现容易。然而，它容易出现聚集现象，即连续的多个位置被占用，导致查找时间增加。

二次探测法 二次探测法（Quadratic Probing）为了解决线性探测法的聚集问题，它在发生碰撞时，不是线性地查找下一个位置，而是按照二次函数的方式查找。例如，第一次检查 index + 1^2，第二次检查 index + 2^2，第三次检查 index + 3^2 等。

// 二次探测法的哈希表结构体
typedef struct QuadraticProbingHashTable {
    char **keys;
    int *values;
    int size;
    int count;
} QuadraticProbingHashTable;

// 初始化二次探测法的哈希表
QuadraticProbingHashTable* create_quadratic_probing_hash_table(int size) {
    QuadraticProbingHashTable *hash_table = (QuadraticProbingHashTable *)malloc(sizeof(QuadraticProbingHashTable));
    hash_table->size = size;
    hash_table->count = 0;
    hash_table->keys = (char **)malloc(size * sizeof(char *));
    hash_table->values = (int *)malloc(size * sizeof(int));
    for (int i = 0; i < size; i++) {
        hash_table->keys[i] = NULL;
    }
    return hash_table;
}

// 二次探测法的插入操作
void quadratic_probing_insert(QuadraticProbingHashTable *hash_table, const char *key, int value) {
    if (hash_table->count >= hash_table->size * 0.75) {
        // 负载因子过高，进行扩容
        // 此处省略扩容代码
    }
    unsigned int hash = hash_function(key, hash_table->size);
    int i = 1;
    while (hash_table->keys[hash] != NULL) {
        if (strcmp(hash_table->keys[hash], key) == 0) {
            hash_table->values[hash] = value;
            return;
        }
        hash = (hash + i * i) % hash_table->size;
        i++;
    }
    hash_table->keys[hash] = strdup(key);
    hash_table->values[hash] = value;
    hash_table->count++;
}

// 二次探测法的查找操作
int quadratic_probing_search(QuadraticProbingHashTable *hash_table, const char *key) {
    unsigned int hash = hash_function(key, hash_table->size);
    int i = 1;
    while (hash_table->keys[hash] != NULL) {
        if (strcmp(hash_table->keys[hash], key) == 0) {
            return hash_table->values[hash];
        }
        hash = (hash + i * i) % hash_table->size;
        i++;
    }
    return -1; // 未找到
}

二次探测法减少了聚集现象，提高了哈希表的性能。但是，如果哈希表大小不是质数，可能会导致无法探测到所有位置。

双重哈希法 双重哈希法（Double Hashing）使用两个哈希函数。第一个哈希函数用于计算初始位置，当发生碰撞时，使用第二个哈希函数计算一个步长，然后按照这个步长查找下一个位置。例如，h1(key) 计算初始位置，h2(key) 计算步长。

// 双重哈希法的哈希表结构体
typedef struct DoubleHashingHashTable {
    char **keys;
    int *values;
    int size;
    int count;
} DoubleHashingHashTable;

// 第二个哈希函数
unsigned int second_hash_function(const char *str, int table_size) {
    unsigned int hash = 0;
    while (*str) {
        hash = hash * 31 + *str++;
    }
    return 1 + (hash % (table_size - 1));
}

// 初始化双重哈希法的哈希表
DoubleHashingHashTable* create_double_hashing_hash_table(int size) {
    DoubleHashingHashTable *hash_table = (DoubleHashingHashTable *)malloc(sizeof(DoubleHashingHashTable));
    hash_table->size = size;
    hash_table->count = 0;
    hash_table->keys = (char **)malloc(size * sizeof(char *));
    hash_table->values = (int *)malloc(size * sizeof(int));
    for (int i = 0; i < size; i++) {
        hash_table->keys[i] = NULL;
    }
    return hash_table;
}

// 双重哈希法的插入操作
void double_hashing_insert(DoubleHashingHashTable *hash_table, const char *key, int value) {
    if (hash_table->count >= hash_table->size * 0.75) {
        // 负载因子过高，进行扩容
        // 此处省略扩容代码
    }
    unsigned int hash1 = hash_function(key, hash_table->size);
    unsigned int hash2 = second_hash_function(key, hash_table->size);
    unsigned int hash = hash1;
    while (hash_table->keys[hash] != NULL) {
        if (strcmp(hash_table->keys[hash], key) == 0) {
            hash_table->values[hash] = value;
            return;
        }
        hash = (hash + hash2) % hash_table->size;
    }
    hash_table->keys[hash] = strdup(key);
    hash_table->values[hash] = value;
    hash_table->count++;
}

// 双重哈希法的查找操作
int double_hashing_search(DoubleHashingHashTable *hash_table, const char *key) {
    unsigned int hash1 = hash_function(key, hash_table->size);
    unsigned int hash2 = second_hash_function(key, hash_table->size);
    unsigned int hash = hash1;
    while (hash_table->keys[hash] != NULL) {
        if (strcmp(hash_table->keys[hash], key) == 0) {
            return hash_table->values[hash];
        }
        hash = (hash + hash2) % hash_table->size;
    }
    return -1; // 未找到
}

双重哈希法进一步减少了聚集现象，性能较好，但实现相对复杂，需要精心设计第二个哈希函数。

链地址法

链地址法（Separate Chaining）是另一种常用的碰撞解决方法。在链地址法中，哈希表的每个位置不是存储单个元素，而是存储一个链表。当发生碰撞时，新元素被插入到对应的链表中。我们前面定义的基于链表的哈希表就是链地址法的实现。

链地址法的优点是简单直观，对哈希函数的要求相对较低，而且删除操作也比较容易实现。但是，它需要额外的链表节点内存，并且在链表较长时查找性能会下降。

碰撞解决算法对比

性能对比

时间复杂度
- 链地址法：在理想情况下，哈希函数均匀分布，插入、查找和删除操作的平均时间复杂度为 O(1)。但是，如果哈希函数不好，导致链表过长，最坏情况下时间复杂度会退化为 O(n)，其中 n 是链表的长度。
- 线性探测法：平均情况下，插入和查找的时间复杂度为 O(1)，但在聚集严重时，时间复杂度会接近 O(n)，这里 n 是哈希表中元素的数量。删除操作需要特殊处理，通常标记删除而不是真正删除，以避免影响后续查找，时间复杂度也接近 O(n)。
- 二次探测法：平均情况下性能优于线性探测法，插入和查找的时间复杂度接近 O(1)。但如果哈希表大小不是质数，可能会导致无法遍历所有位置，最坏情况下时间复杂度也会上升。
- 双重哈希法：平均情况下插入、查找和删除的时间复杂度接近 O(1)，它通过两个哈希函数减少聚集，性能相对稳定，最坏情况下时间复杂度也比线性探测法好。
空间复杂度
- 链地址法：除了哈希表本身的空间，还需要为链表节点分配额外的空间，空间复杂度为 O(n)，其中 n 是哈希表中元素的数量。
- 开放定址法：只需要哈希表本身的空间，空间复杂度为 O(m)，其中 m 是哈希表的大小。但是，开放定址法需要预留一定的空间以避免负载因子过高，可能会浪费一些空间。

适用场景对比

链地址法：适用于数据量较大且无法预估具体数量的场景，因为它可以动态地增加链表长度。例如，在实现一个简单的缓存系统时，链地址法可以方便地处理不断变化的缓存数据。
线性探测法：适用于数据量较小且对空间要求较高的场景。由于它不需要额外的链表节点空间，在空间有限的情况下可以优先考虑。例如，在一些嵌入式系统中，线性探测法可能是一个不错的选择。
二次探测法和双重哈希法：适用于对性能要求较高，需要减少聚集现象的场景。例如，在数据库索引等对查找性能要求极高的应用中，二次探测法或双重哈希法可能更合适。

实现复杂度对比

链地址法：实现相对简单，只需要定义链表节点和哈希表结构体，以及相应的插入、查找和删除操作。链表的操作相对直观，容易理解和实现。
线性探测法：实现也比较简单，主要是在插入和查找时进行线性探测。但是，处理删除操作需要一些额外的考虑，以避免影响后续的查找。
二次探测法：比线性探测法稍微复杂一些，需要按照二次函数的方式进行探测。在实现时需要注意哈希表大小的选择，以确保能够遍历所有位置。
双重哈希法：实现最为复杂，需要设计两个哈希函数，并且在插入和查找时使用第二个哈希函数计算步长。对哈希函数的设计要求也较高，需要保证其均匀性和高效性。

总结碰撞解决算法

不同的碰撞解决算法各有优缺点，在实际应用中需要根据具体的需求和场景来选择合适的算法。如果对空间比较敏感，数据量较小，可以考虑开放定址法中的线性探测法；如果对性能要求极高，对空间不太敏感，链地址法或更复杂的开放定址法（如二次探测法、双重哈希法）可能更合适。通过合理选择碰撞解决算法，可以设计出高效、稳定的哈希表，满足各种应用场景的需求。在实际编程中，还需要根据具体情况对哈希表进行优化，如调整哈希表大小、优化哈希函数等，以进一步提高哈希表的性能。