Union-Find Data Structure Explained: Find and Union Operations

You want to master the Union-Find data structure—also known as Disjoint Set Union (DSU)—a powerful tool for tracking and merging disjoint sets of elements. Its core purpose is to efficiently manage two key operations: find (determine which set an element belongs to) and union (merge two sets), while avoiding redundant work through optimizations like path compression and union by rank/size.

Union-Find is the backbone of algorithms like Kruskal’s (MST), used in network design, cluster analysis, and cycle detection. It shines because its operations are nearly constant time (amortized O(α(V)), where α is the inverse Ackermann function—effectively O(1) for all practical purposes).

Core Concepts

Before diving into implementation, let’s clarify the basics:

Disjoint Sets: Sets with no overlapping elements (e.g., {0,1,2} and {3,4} are disjoint).
Find Operation: Given an element, find the root (representative) of its set. The root is a unique identifier for the set (e.g., if 0 is the root of {0,1,2}, find(2) returns 0).
Union Operation: Merge two sets into one by linking their roots (e.g., merging {0,1,2} and {3,4} creates {0,1,2,3,4} with a single root).
Key Optimizations:
1. Path Compression: Flattens the structure of the tree during find, so future queries are faster.
2. Union by Rank/Size: Ensures the smaller tree is always attached to the root of the larger tree, keeping the tree shallow.

Union-Find Implementation (Python)

We’ll implement Union-Find with both optimizations (path compression + union by rank) for maximum efficiency. This implementation supports numeric elements (easily adaptable to other types with a hash map).

Full Implementation

python

运行

class UnionFind:
    """
    Union-Find (Disjoint Set Union, DSU) data structure with path compression and union by rank.
    Supports numeric elements (0-indexed or arbitrary integers).
    """
    def __init__(self, size):
        """
        Initialize Union-Find with `size` elements (0 to size-1).
        Args:
            size (int): Number of elements to manage (must be ≥ 1).
        """
        if size < 1:
            raise ValueError("Size must be at least 1")
        
        # parent[i] = parent of element i (initially, each element is its own parent)
        self.parent = list(range(size))
        
        # rank[i] = upper bound of the height of the tree rooted at i (for union by rank)
        self.rank = [0] * size
        
        # Optional: Track size of each set (alternative to rank for union by size)
        self.size = [1] * size
    
    def find(self, x):
        """
        Find the root of element x with path compression.
        Args:
            x (int): Element to find (must be between 0 and size-1).
        Returns:
            int: Root of the set containing x.
        """
        # Validate input
        if x < 0 or x >= len(self.parent):
            raise IndexError("Element out of bounds")
        
        # Path compression: Make x's parent point directly to the root
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])  # Recursively find root and update parent
        return self.parent[x]
    
    def union(self, x, y):
        """
        Merge the sets containing elements x and y using union by rank.
        Args:
            x (int): First element (0 ≤ x < size).
            y (int): Second element (0 ≤ y < size).
        Returns:
            bool: True if sets were merged (x and y were in different sets), False otherwise (already in the same set).
        """
        # Find roots of x and y
        root_x = self.find(x)
        root_y = self.find(y)
        
        # If already in the same set, do nothing
        if root_x == root_y:
            return False
        
        # Union by rank: Attach smaller rank tree to root of larger rank tree
        if self.rank[root_x] < self.rank[root_y]:
            self.parent[root_x] = root_y
            self.size[root_y] += self.size[root_x]  # Update size of root_y's set
        else:
            self.parent[root_y] = root_x
            self.size[root_x] += self.size[root_y]  # Update size of root_x's set
            
            # If ranks are equal, increment rank of the new root
            if self.rank[root_x] == self.rank[root_y]:
                self.rank[root_x] += 1
        
        return True
    
    def connected(self, x, y):
        """
        Check if elements x and y are in the same set.
        Args:
            x (int): First element.
            y (int): Second element.
        Returns:
            bool: True if x and y are in the same set, False otherwise.
        """
        return self.find(x) == self.find(y)
    
    def get_set_size(self, x):
        """
        Get the size of the set containing element x.
        Args:
            x (int): Element to query.
        Returns:
            int: Size of the set.
        """
        root = self.find(x)
        return self.size[root]
    
    def get_num_sets(self):
        """
        Get the number of disjoint sets currently managed.
        Returns:
            int: Number of disjoint sets.
        """
        # Count elements that are their own parent (roots)
        return sum(1 for i in range(len(self.parent)) if self.parent[i] == i)

# Test the Union-Find implementation
if __name__ == "__main__":
    # Initialize Union-Find with 6 elements (0-5)
    uf = UnionFind(6)
    print("Initial number of sets:", uf.get_num_sets())  # Output: 6
    
    # Merge sets
    uf.union(0, 1)
    uf.union(1, 2)
    uf.union(3, 4)
    print("\nAfter merging (0-1-2) and (3-4):")
    print("Number of sets:", uf.get_num_sets())  # Output: 3 (sets: {0,1,2}, {3,4}, {5})
    print("0 and 2 are connected?", uf.connected(0, 2))  # Output: True
    print("2 and 3 are connected?", uf.connected(2, 3))  # Output: False
    print("Size of set containing 1:", uf.get_set_size(1))  # Output: 3
    
    # Merge more sets
    uf.union(2, 3)
    uf.union(4, 5)
    print("\nAfter merging (0-1-2-3-4) and (5):")
    print("Number of sets:", uf.get_num_sets())  # Output: 1 (all elements connected)
    print("0 and 5 are connected?", uf.connected(0, 5))  # Output: True
    print("Size of set containing 5:", uf.get_set_size(5))  # Output: 6

Sample Output

plaintext

Initial number of sets: 6

After merging (0-1-2) and (3-4):
Number of sets: 3
0 and 2 are connected? True
2 and 3 are connected? False
Size of set containing 1: 3

After merging (0-1-2-3-4) and (5):
Number of sets: 1
0 and 5 are connected? True
Size of set containing 5: 6

How the Optimizations Work

1. Path Compression (in `find` method)

Without path compression, the tree can become deep (e.g., 0 → 1 → 2 → 3 → 4), making find(4) require 4 steps. With path compression:

When find(4) is called, it recursively finds the root (0) and updates all nodes along the path to point directly to 0.
Subsequent calls to find(2) or find(4) will take only 1 step (direct access to root).

2. Union by Rank (in `union` method)

Without union by rank, merging a tall tree into a short tree can create a deep tree. Union by rank ensures:

The shorter tree is always attached to the root of the taller tree.
If trees are the same height, the new root’s rank is incremented (to track the maximum possible height).
This keeps the tree height logarithmic in the number of elements, ensuring fast find operations.

Time and Space Complexity

Operation	Amortized Time Complexity	Explanation
`find(x)`	O(α(V))	α(V) is the inverse Ackermann function—grows extremely slowly (α(V) < 5 for V < 10^600). Effectively O(1) in practice.
`union(x, y)`	O(α(V))	Depends on two `find` operations (each O(α(V))) plus constant-time root linking.
`connected(x,y)`	O(α(V))	Two `find` operations.
`get_set_size(x)`	O(α(V))	One `find` operation plus constant-time size lookup.

Space Complexity	Value	Explanation
Overall	O(V)	Stores `parent`, `rank`, and `size` arrays of size V (number of elements).

Pros and Cons of Union-Find

Pros

Near-Constant Time Operations: Amortized O(α(V)) time for all core operations—faster than most other data structures for set management.
Simple to Implement: Core logic is straightforward, with optimizations adding minimal complexity.
Memory Efficient: Uses O(V) space, which is optimal for tracking V elements.
Versatile: Powers critical algorithms (Kruskal’s, cycle detection) and works for diverse use cases (clustering, network connectivity).

Cons

Limited to Numeric Elements (by Default): To handle non-numeric elements (e.g., strings, objects), you need a hash map to map elements to numeric indices.
No Split Operation: Union-Find does not support splitting a set into two disjoint sets efficiently (this would require a more complex data structure like a link-cut tree).
Requires Fixed Initial Size: The initial size of the Union-Find structure is fixed (though you can extend it with dynamic resizing if needed).

Real-World Applications of Union-Find

Kruskal’s Algorithm (MST): Uses Union-Find to avoid cycles while adding edges to the MST (critical for network design with minimum cost).
Cycle Detection: In undirected graphs—for each edge (u, v), if find(u) == find(v), a cycle exists.
Network Connectivity: Tracking which nodes in a network are connected (e.g., internet routers, social networks).
Cluster Analysis: Grouping similar data points (e.g., image segmentation, customer segmentation).
Dynamic Connectivity: Answering connectivity queries in real-time (e.g., “Is node A connected to node B after adding edge C?”).
Games: Managing player teams or territories (e.g., in strategy games where players merge regions).