Given n (n > 1) distinct numbers (1, 2, 3, ..., n for example) in random order, insert them to an empty BST successively. The expected number of its leaves is (n 1) / 3.
I tried to use method of mathematical induction: f[n] is the expected number of leaves in a BST build with n distinct numbers. What I thaught about is which place the nth number is interted to (it will be a leaf node). If its father node used not to be a leaf, then the number of leaves would increase by one. So I calculated the possibility of it being such a node: (n / 3 1) / (2n / 3 1) (equation set n0 = n2 1, n0 n1 n2 = n and n0 = n / 3 where ni mean the number of node with i children helps to calculate the number of each kind of node) but it's obviously not 1 / 3 so f[n 1] does not equal to f[n] 1 / 3.
Could someone explain it? Thanks!
CodePudding user response:
Your proof is wrong, because it assumes some sort of "average" tree with exactly the right number of leaves.
Here's a valid proof. Maybe there's something shorter, but this looks right to me.
We want to prove by induction that the average number of leaves L[n] is 0 for n=0, 1 for n=1 and (n 1)/3 for n>1.
The base cases are easy to check: L[0] = 0, L[1] = 1, L[2] = 1.
Suppose (for induction) it's true for all tree sizes less than n. That is, L[m] = (m 1)/3 for all m<n.
If you're shuffling (1...n) and then inserting them into a BST, then the root of the tree is the first item in the permutation (call it k), there's (k-1) things off the left child, and (n-k) things off the right child. Each of those subtrees satisfies the induction hypothesis -- it's a BST formed by a random permutation of some set of distinct numbers. By linearity of expectation, and because each k=1..n is equally likely, L[n] = sum(L[k-1] L[n-k], k=1..n)/n.
Now, L[n] = sum(L[k-1] L[n-k], k=1..n)/n = ((L[0] L[n-1]) (L[1] L[n-2]) sum(L[k-1] L[n-k], k=3..n-2) (L[n-2] L[1]) (L[n-1] L[0]))/n -- this is just separating out the k=1, 2, n-1 and n cases from the sum.
And L[0] L[n-1] L[1] L[n-2] = n/3 1 (n-1)/3 (by induction) = 2/3 2n/3
And sum(L[k-1] L[n-k], k=3..n-2) = sum(k/3 (n-k 1)/3, k=3..n-2) = (n 1)(n-4)/3 (by induction)
So the average number of leaves is (4/3 4n/3 (n 1)(n-4)/3) / n = (n 1)/3.
CodePudding user response:
After inserting n nodes, we expect f(n) leaves. Obviously we have f(1)=1.
Of the n 1 places that a new node could be inserted, we therefore expect that 2f(n) of those are under leaves. We expect non-leaf nulls to number n 1-2f(n)
Inserting under a leaf will not change the number of leaves, but inserting under a non-leaf will increase the leaf count by one, so we have:
f(n 1) = f(n) (n 1-2f(n))/(n 1)
or
f(n) = f(n-1) (n-2f(n-1))/n = 1 f(n-1) - 2f(n-1)/n
Now it's easy to prove by induction that (n 1)/3 works for all n. If f(n-1) = n/3, then:
f(n) = 1 n/3 - 2n/3n = (n 1)/3