Home > Software engineering >  Should B Trees with variable length keys and values be split based on size (bytes)?
Should B Trees with variable length keys and values be split based on size (bytes)?

Time:11-03

I am trying to learn and implement a B Tree for physical storage that supports variable length keys and values. It is to my understanding that you want to reduce the number of disk reads, so it makes sense to me to split a node when it's size (in bytes) overflows the page size (or the optimal read write size). However, is this still a good practice when only a single key can fit into a node, or is it better to start utilizing overflow pages/pointers when keys reach larger sizes to increase fanout? Do nodes need a minimum number of children for better performance?

CodePudding user response:

With a B tree designed for fixed-sized pages you want to limit the key size such that at least two of them will fit in a node. Or in other words, the maximum key size before using overflow pages should be less than half the size of the fixed-size page. Here's why:

Imagine that all keys are larger than half the page size, and that they're incompressible. Start with a B Tree with one node and one key. Now insert a new key. That one node needs to immediately split, and a new root node is created. As result, the B tree consists of three nodes, all of them full. Now insert another key. This causes a leaf node split, and the parent node splits too, and a new root node is created.

Do you see where this is going? The height of the B Tree will match the width, and so it's not a terribly effective tree at all. Searching for one key out of one million requires examining one million nodes.

  • Related