Hash Tables Explained: Functions, Performance, and Use Cases

Hash Table (or Hash Map) is a fundamental data structure that implements an associative array abstract data type, mapping keys to values. It uses a hash function to compute an index (called a “hash code”) into an array of buckets or slots, allowing for average O(1) time complexity for insertions, deletions, and lookups. This makes hash tables one of the most efficient data structures for frequent key-value operations, widely used in databases, caches, compilers, and more.

I. Core Principles of Hash Tables

Hash Tables rely on three key components:

  1. Hash Function: Converts a key (e.g., string, integer, object) into an integer index within the bounds of the bucket array. A good hash function should:
    • Distribute keys uniformly across buckets (minimize collisions).
    • Be fast to compute (O(1) time).
    • Avoid bias (no clustering of keys in a small subset of buckets).
  2. Bucket Array: A fixed-size array where each element (bucket) stores a collection of key-value pairs that hash to the same index (to handle collisions).
  3. Collision Resolution: Since multiple keys may hash to the same index (a “collision”), two common strategies are used:
    • Chaining: Each bucket is a linked list (or dynamic array) that stores all key-value pairs with the same hash index.
    • Open Addressing: When a collision occurs, the algorithm probes for the next available bucket (using linear probing, quadratic probing, or double hashing) instead of using a linked list.

Key Terminology

  • Load Factor (α): The ratio of the number of key-value pairs (n) to the number of buckets (m), i.e., α = n/m. A higher load factor (typically > 0.7) increases collision frequency and degrades performance.
  • Rehashing: When the load factor exceeds a threshold, the bucket array is resized (e.g., doubled), and all existing key-value pairs are rehashed into the new array to maintain efficiency.

II. Hash Function Design

A well-designed hash function is critical for hash table performance. Below are examples of hash functions for common key types:

1. Integer Keys

For integer keys, a simple modulo operation works (but ensure the modulus is a prime number to reduce collisions):

python

运行

def hash_integer(key, num_buckets):
    return key % num_buckets  # Modulo prime number for better distribution

2. String Keys

For strings, convert each character to its ASCII value, compute a weighted sum, and apply modulo:

python

运行

def hash_string(key, num_buckets):
    hash_code = 0
    prime = 31  # Small prime to reduce collisions
    for char in key:
        hash_code = (hash_code * prime + ord(char)) % num_buckets
    return hash_code

3. Object Keys

For custom objects, override the hash function (e.g., in Python, __hash__() method) to use object attributes that define equality:

python

运行

class Person:
    def __init__(self, id, name):
        self.id = id
        self.name = name
    
    def __hash__(self):
        # Hash based on unique ID (ensures equal objects have same hash)
        return hash(self.id)
    
    def __eq__(self, other):
        # Define equality to match hash function
        return isinstance(other, Person) and self.id == other.id

III. Collision Resolution Strategies

1. Chaining (Linked List Buckets)

Chaining is the most common collision resolution method. Each bucket contains a linked list (or dynamic array) of key-value pairs that hash to the same index.

Implementation Code (Python)

python

运行

class HashTableChaining:
    def __init__(self, initial_size=10):
        self.size = initial_size  # Number of buckets
        self.buckets = [[] for _ in range(self.size)]  # Each bucket is a list (linked list alternative)
        self.count = 0  # Number of key-value pairs
        self.load_factor_threshold = 0.7  # Trigger rehash when load factor > 0.7

    def _hash(self, key):
        """Private hash function to compute bucket index"""
        if isinstance(key, int):
            return key % self.size
        elif isinstance(key, str):
            hash_code = 0
            prime = 31
            for char in key:
                hash_code = (hash_code * prime + ord(char)) % self.size
            return hash_code
        else:
            # Use built-in hash for custom objects (requires __hash__ and __eq__ methods)
            return hash(key) % self.size

    def _rehash(self):
        """Resize the bucket array and rehash all key-value pairs"""
        old_buckets = self.buckets
        self.size *= 2  # Double the number of buckets
        self.buckets = [[] for _ in range(self.size)]
        self.count = 0

        # Reinsert all key-value pairs into new buckets
        for bucket in old_buckets:
            for key, value in bucket:
                self.put(key, value)

    def put(self, key, value):
        """Insert or update a key-value pair"""
        # Check load factor and rehash if needed
        if self.count / self.size > self.load_factor_threshold:
            self._rehash()

        index = self._hash(key)
        bucket = self.buckets[index]

        # Update value if key already exists
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return

        # Add new key-value pair if not exists
        bucket.append((key, value))
        self.count += 1

    def get(self, key):
        """Retrieve the value for a given key (return None if key not found)"""
        index = self._hash(key)
        bucket = self.buckets[index]

        for k, v in bucket:
            if k == key:
                return v
        return None  # Key not found

    def remove(self, key):
        """Remove a key-value pair (return True if successful, False otherwise)"""
        index = self._hash(key)
        bucket = self.buckets[index]

        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket.pop(i)
                self.count -= 1
                return True
        return False  # Key not found

    def __str__(self):
        """String representation of the hash table"""
        result = []
        for i, bucket in enumerate(self.buckets):
            if bucket:
                result.append(f"Bucket {i}: {bucket}")
        return "\n".join(result)

# Test Chaining Hash Table
if __name__ == "__main__":
    ht = HashTableChaining()
    ht.put("apple", 5)
    ht.put("banana", 3)
    ht.put("cherry", 7)
    ht.put("apple", 10)  # Update existing key

    print("Hash Table (Chaining):")
    print(ht)
    # Output example:
    # Bucket 0: [('banana', 3)]
    # Bucket 1: [('apple', 10)]
    # Bucket 2: [('cherry', 7)]

    print("\nGet 'banana':", ht.get("banana"))  # Output: 3
    print("Remove 'cherry':", ht.remove("cherry"))  # Output: True
    print("Get 'cherry':", ht.get("cherry"))  # Output: None

2. Open Addressing (Linear Probing)

Open Addressing stores all key-value pairs directly in the bucket array. When a collision occurs, the algorithm probes consecutive buckets (linear probing: index + 1, index + 2, ...) until an empty bucket is found.

Implementation Code (Python)

python

运行

class HashTableOpenAddressing:
    def __init__(self, initial_size=10):
        self.size = initial_size
        self.keys = [None] * self.size  # Stores keys (None = empty, "DELETED" = tombstone)
        self.values = [None] * self.size
        self.count = 0
        self.load_factor_threshold = 0.7
        self.DELETED = "DELETED"  # Tombstone marker for deleted entries

    def _hash(self, key):
        """Same hash function as chaining"""
        if isinstance(key, int):
            return key % self.size
        elif isinstance(key, str):
            hash_code = 0
            prime = 31
            for char in key:
                hash_code = (hash_code * prime + ord(char)) % self.size
            return hash_code
        else:
            return hash(key) % self.size

    def _probe(self, index):
        """Linear probing: return next available bucket index"""
        for i in range(self.size):
            probe_index = (index + i) % self.size
            if self.keys[probe_index] is None or self.keys[probe_index] == self.DELETED:
                return probe_index
        raise Exception("Hash Table is full (should never happen with rehashing)")

    def _rehash(self):
        """Resize and rehash all entries"""
        old_keys = self.keys
        old_values = self.values
        self.size *= 2
        self.keys = [None] * self.size
        self.values = [None] * self.size
        self.count = 0

        for k, v in zip(old_keys, old_values):
            if k is not None and k != self.DELETED:
                self.put(k, v)

    def put(self, key, value):
        """Insert or update a key-value pair"""
        if self.count / self.size > self.load_factor_threshold:
            self._rehash()

        index = self._hash(key)

        # Find the correct bucket (update if key exists)
        for i in range(self.size):
            probe_index = (index + i) % self.size
            if self.keys[probe_index] == key:
                self.values[probe_index] = value
                return
            if self.keys[probe_index] is None or self.keys[probe_index] == self.DELETED:
                self.keys[probe_index] = key
                self.values[probe_index] = value
                self.count += 1
                return

    def get(self, key):
        """Retrieve value for a key"""
        index = self._hash(key)

        for i in range(self.size):
            probe_index = (index + i) % self.size
            if self.keys[probe_index] is None:
                return None  # Key not found (no more probes needed)
            if self.keys[probe_index] == key:
                return self.values[probe_index]
        return None  # Key not found

    def remove(self, key):
        """Remove a key-value pair (use tombstone to avoid breaking probes)"""
        index = self._hash(key)

        for i in range(self.size):
            probe_index = (index + i) % self.size
            if self.keys[probe_index] is None:
                return False  # Key not found
            if self.keys[probe_index] == key:
                self.keys[probe_index] = self.DELETED  # Mark as deleted
                self.values[probe_index] = None
                self.count -= 1
                return True
        return False

    def __str__(self):
        result = []
        for i in range(self.size):
            if self.keys[i] is not None and self.keys[i] != self.DELETED:
                result.append(f"Index {i}: ({self.keys[i]}, {self.values[i]})")
        return "\n".join(result)

# Test Open Addressing Hash Table
if __name__ == "__main__":
    ht_oa = HashTableOpenAddressing()
    ht_oa.put(10, "ten")
    ht_oa.put(20, "twenty")
    ht_oa.put(30, "thirty")  # Hashes to same index as 10 (10%10=0, 30%10=0) → linear probe to index 2

    print("\nHash Table (Open Addressing):")
    print(ht_oa)
    # Output example:
    # Index 0: (10, 'ten')
    # Index 1: (20, 'twenty')
    # Index 2: (30, 'thirty')

    print("\nGet 30:", ht_oa.get(30))  # Output: 'thirty'
    print("Remove 20:", ht_oa.remove(20))  # Output: True
    print("Get 20:", ht_oa.get(20))  # Output: None

IV. Performance Analysis

MetricChainingOpen Addressing (Linear Probing)Explanation
Average Time Complexity (Insert/Lookup/Delete)O(1 + α)O(1/(1-α))– α = load factor.- Chaining: Average chain length is α.- Open Addressing: Probe sequence length increases with α.
Worst-Case Time ComplexityO(n)O(n)– Chaining: All keys hash to one bucket (chain length n).- Open Addressing: All buckets are filled (probe entire array).
Space ComplexityO(n + m)O(m)– Chaining: Stores n key-value pairs + m buckets.- Open Addressing: Uses fixed m buckets (no extra space for chains).
Collision HandlingRobust (chains grow dynamically)Sensitive to clustering (linear probing causes primary clustering)– Chaining: No limit to chain length.- Open Addressing: Probing can lead to clustered entries, degrading performance.
Load Factor ToleranceCan handle higher α (e.g., 0.9)Better with lower α (e.g., ≤ 0.7)– Open Addressing performance degrades rapidly with high α due to probing.

Key Takeaways

  • Chaining is preferred for most practical applications:
    • Simpler to implement (no tombstones needed for deletions).
    • Robust to high load factors.
    • Works well with dynamic key sets.
  • Open Addressing is useful for memory-constrained environments:
    • No extra space for chains (stores entries directly in the bucket array).
    • Faster access in practice for low load factors (cache-friendly).

V. Practical Applications of Hash Tables

  1. Caching: Storing frequently accessed data (e.g., web page content, database query results) for O(1) lookups (e.g., Redis, Memcached).
  2. Databases: Indexing primary keys and foreign keys to speed up query execution (e.g., hash indexes in MySQL).
  3. Compilers: Symbol tables to track variable names, functions, and their addresses during compilation.
  4. Dictionaries/Hash Maps in Languages: Built-in data structures (e.g., Python dict, Java HashMap, C++ unordered_map) use hash tables.
  5. Networking: Routing tables to map IP addresses to MAC addresses (ARP cache) for fast packet forwarding.
  6. Duplicate Detection: Finding duplicate elements in a dataset (e.g., checking if a username/email already exists).

VI. Common Pitfalls and Best Practices

  1. Poor Hash Function: Leads to uneven key distribution and high collision rates. Use:
    • Prime numbers for modulo operations (reduces clustering).
    • Weighted sums for string keys (e.g., multiplying by a prime like 31).
    • Custom __hash__ and __eq__ methods for objects (ensure equal objects have the same hash).
  2. Ignoring Load Factor: High load factors (α > 0.7) degrade performance. Implement rehashing to resize the bucket array.
  3. Tombstone Management (Open Addressing): Deleted entries must be marked with tombstones to avoid breaking probe sequences for existing keys.
  4. Key Immutability: Keys should be immutable (e.g., integers, strings, tuples in Python) to ensure their hash value doesn’t change after insertion (changing a key’s value would break lookups).

Summary

Best practices: Use a good hash function, manage load factors with rehashing, and ensure keys are immutable.

Hash Table is a key-value data structure that uses a hash function to map keys to bucket indexes, enabling average O(1) time complexity for insertions, lookups, and deletions.

Core components: Hash function, bucket array, and collision resolution (chaining or open addressing).

Chaining is the most common collision resolution method (simple, robust, and cache-friendly for most cases).

Open Addressing is memory-efficient but sensitive to clustering and load factors.

Practical applications include caching, databases, symbol tables, and built-in language dictionaries.



了解 Ruigu Electronic 的更多信息

订阅后即可通过电子邮件收到最新文章。

Posted in

Leave a comment