Small to Large Merging Notes [Advanced]

Question Details

Here's some notes on small to large merging. Written by me (a human), not AI. Please note this is a more advanced topic and **not for interviews**, but it does appear in LeetCode. **1: Naive version**

Full Details

Here's some notes on small to large merging. Written by me (a human), not AI. Please note this is a more advanced topic and **not for interviews**, but it does appear in LeetCode. **1: Naive version** Say we have a DFS function and we call it on every node in the tree. At each node, we iterate over all pairs of children inside. It looks `O(n^3)` because we do `n^2` work at `n` nodes but it is actually `O(n^2)` because we can see every pair of nodes only gets considered once, at their LCA. def dfs(node): for child1 in subtree[node]: for child2 in subtree[node]: # do something **2: Speedup with small to large merging** Now instead of looping over pairs of children we will just loop over children. Each child contains some data, with size `c` where `c = size of that child tree`. def dfs(node): accumulator = [] for child in children[node]: childData = dfs(child) # the size of this is the size of that child tree if len(accumulator) < len(childData): accumulator, childData = childData, accumulator # crucial small to large to get O(n log n) for val in childData: # do work for val in childData: # safely update accumulator Proof of `O(n log n)` time complexity: Consider any element `e`. It can be in the smaller bucket at most `log N` times. Every time it's in the small bucket, it gets merged into a larger bucket of at least the same size, meaning the container size doubles. The container size can double at most `log N` times. **3: Simpler version of small to large merging that is O(n log n)** I have found instead of swapping `accumulator` and `childData` we can just pick the heaviest child as the root container and merge everything else in. This is because if we initialize the accumulator on the largest child, then every other child bucket would be smaller, meaning the bucket size doubles. The previous argument then holds. def dfs(node): childrenData = [dfs(child) for child in children[node]] # a bunch of buckets, each bucket is the size of that child tree childrenData.sort(key = lambda x: len(x), reverse=True) heavyChild = childrenData[0] for i in range(1, len(childrenData)): # merge this child into our root **4: Traps** It is not safe to execute `O(heavyChild)` work in each node, like this: def dfs(node): childrenData.sort(key = lambda x: len(x), reverse=True) heavyChild = childrenData[0] newDataStructure = fn(heavyChild) # takes O(heavyChild) work, NOT SAFE Imagine a stick tree, we would do 1+2+3+...+N = `O(n^2)` work. Example bad submission (that somehow still passed): [https://leetcode.com/problems/maximum-xor-of-two-non-overlapping-subtrees/submissions/1967670898/](https://leetcode.com/problems/maximum-xor-of-two-non-overlapping-subtrees/submissions/1967670898/) The fix is to re-use structures. def dfs(node): childrenData = [dfs(child) for child in subtree[node]] childrenData.sort(key = lambda x: len(x), reverse=True) heavyChildBinaryTrie, heavyChildrenValues = childrenData[0] # re-using our heaviest child structure for i in range(1, len(childrenData)): lightChildBinaryTrie, lightChildValues = childrenData[i] # now we can loop over each light child value and update the heavyChildBinaryTrie, the lightChildBinaryTrie gets thrown away for v in lightChildValues: # update result here for v in lightChildValues: # update the accumulator (separate step to not pollute the accumulator in one-pass for v in lightChildValues: heavyChildrenValues.append(v) # extend the element list (also can do these in the previous loop) return heavyChildBinaryTrie, heavyChildrenValues We also cannot do something like a`llValues = [val for childData in childrenData for val in childData]` because this is going to loop over heavy values. Golden rule: We cannot do heavy work in a node. Instead, just append the list of light values to the heavy values at the end, like the above code. **5: Sorting is safe** Note that we can safely sort children inside a node, and it doesn't break the `O(n log n)` invariant: def dfs(node): childrenData = [dfs(child) for child in subtree[node]] childrenData.sort(key = lambda x: len(x), reverse=True) # this is safe! because every node gets considered in the sort once If anything, sorting two lists of size `n/2` is faster than a single sort on `n` so this is fine performance wise. But it isn't necessary. We could locate the heavy child in an `O(n)` pass anyway. **6: Separating the accumulators from the data we send up** Note that accumulators can use separate data than the actual values we send up. For instance if we want the max XOR of any two non-overlapping subtree sums, we can send up sums of subtrees, and bit tries for accumulators. **7: Piecing it together** Here's a sample solution combining all of the above concepts. It is `O(n * B * log n)` complexity: [https://leetcode.com/problems/maximum-xor-of-two-non-overlapping-subtrees/submissions/1967703802/](https://leetcode.com/problems/maximum-xor-of-two-non-overlapping-subtrees/submissions/1967703802/)

Free preview — Unlock all questions →

Topics

Trees Graphs Sorting Bit Manipulation

Small to Large Merging Notes [Advanced]

Question Details

Full Details

Topics

More from Small