I am currently trying to understand and recreate a database. One common way of storing data in a database is using a B Tree. Data is only stored on so called 'leaf nodes', which are indexed by 'index nodes'. If a full leaf node is inserted into it is split with its content being split equally to the two new nodes. I understand this happens to keep the tree balanced and easily searchable. But I can't quite understand this procedure.
The example is taken from this thread 'splitting a node in b tree'.
Let's say we have a B Tree like this:
[21 30 50]
/ \ \ \
[10|20|-] => [21|22|25] => [30|40|-] => [50|60|80]
And then we insert the key '23'. We would need to split the [21|22|25] node. In order to balance the tree we distribute the keys evenly and arrive at this tree:
[21 30 50]
/ \ \ \
[10|20|-] => [21|22|-] => [23|25|-] => [30|40|-] => [50|60|80]
||
||
\/
[ 30 ]
/ \
[21 23] [50]
/ \ \ / \
[10|20|-] => [21|22|-] => [23|25|-] => [30|40|-] => [50|60|80]
In my understanding of B Trees this leaves the [21|22|-] leaf unable the be filled up to maximum capacity. This is because any key smaller than 21 will be entered into the [10|20|-] leaf and 23 is already a present key.
If we think of leaf nodes as pages as they are present in a database, this lets precious memory and disk space go to waste and more disk operations will be necessary to find values, making the database slower.
I would really appreciate feedback on my thoughts and some help clearing up misunderstandings if they are present.
Thank you very much!
CodePudding user response:
This is known as the b-tree "fill factor". When inserting in random order, a fill factor of 70% is the best that can be achieved in practice. When inserting in order, it drops to 50% as you have observed. The b*tree split algorithm performs rebalancing before splitting, and it achieves ~66% fill factor when inserting entries in order.
B-tree designs can improve the fill factor even more using a few techniques. One way is to detect in-order insertion and split the leaf unbalanced. The new left node keeps most of the original entries, but the new right node starts out mostly empty. This technique needs to also detect reverse insertion order in order to prevent creating an abysmal fill factor.
By combining various techniques, a fill factor of 100% is possible with in order insertion, and 80% with random insertion order. Going higher than 80% requires more sophisticated rebalancing steps, which might end up being too costly in practice.