How can I reason about performance effects (of changing to a more compact representation for 48-bit-CodePudding

We have an application which searches through a massive array of structures for which the key is a 48 bit integer. We have always used a 64-bit type to represent the 48-bit integer on the assumption that space is cheap and alignment improves performance.

The space used by the array is huge. So I am curious as to what effect switching to a compact representation would actually have.

This is a performance question so the essential answer is measure. The question is what can we hypothesize in advance of measuring?

Effectively changing:

struct Entry
{
   uint64_t key;
   int32_t value1;
   int32_t value2;   
};
std::vector<Entry> foobar;

to say:

struct uint48
{
  unsigned long long v:48;
} __attribute__((packed));

struct Entry
{
   uint48 key;
   int32_t value1;
   int32_t value2;   
} __attribute__((packed));
std::vector<Entry> snafu;

Making this change in practice would save a fair bit of memory. I am curious what effect I could expect it to have on performance in practice.

The only way to know for sure is of course to measure but this would require a not insignificant amount of effort.

What can I reason about the expected effects?

This is my starting point:

Searching through the structure consists of a number of operations based around on operations on the 48-bit key (mostly hamming distance rather than equality).
The structure remains read-only throughout.
There are heuristics which mean that successive searches should use local data and hopefully exploit CPU caching.
Assume an x86_64 based architecture.

On the one hand we have:

An Entry is currently 128bits = 16 bytes wide.

Reducing it to 14 will remove any 16 byte alignment benefit if such a thing exists.
Extracting a 48 bit integer becomes more expensive.

i.e. uint64_t foo = someEntry.key;

On the other hand:

Reducing the size of each entry increases the chance of wanted data being in a cache.

The data set overall is way too big to completely fit in any cache.

The cost of reducing the memory use is likely a reduction in performance but is there a way to estimate it that would justify performing the actual experiment?

For example if we expect a 50% loss of performance its probably not worth even trying. If its likely to be more like 5% on the other hand it may be worth doing.

I hope to perform the experiment in the future regardless but it is likely to be quite a distant future due to other commitments.

I think this is an interesting question in terms of how you should approach thinking about such problems.

CodePudding user response：

Your understanding of the C and C 'bitfield' feature is terribly lacking

Your struct uint48 is saying "Allocate me an unsigned long long. Now place v inside it using 48 bits leaving 16 bits of padding.".

A group of bitfield members is always padded out to the size of its foundational type. (sizeof (uint48)) >= (sizeof (unsigned long long))

You can only have memory savings when you have multiple non-static bitfields inside the same structure with the same foundational type, they are declared adjacent to each other, and more than one adjacent field fits within the foundational type for the bitfields.

Your change had no effect on memory consumption.

To make search performance better, use a better search algorithm

You mention that you are searching for inexact matches based on Hamming distance. You need to sort your structure and implement lookup that takes advantage of that sorting to perform pruning.

If you are looking for matches with a Hamming distance of 4 or fewer, and there are 6 bits different in the first 16, you should skip over the entire block because no matter what values appear in the next 32 bits, you'll never get a total distance of 4. So you can prune the entire subspace and not search through it.

A radix tree is likely going to be a suitable structure (maybe not optimal but a huge improvement over what you have) because the prefix distance provides a lower bound for the distance deeper in that branch of the tree allowing you to skip (prune) large sections of the tree.

This sort of data structure subsumes @krlmlr's excellent point about moving the associated values away from the key list, because in a radix tree the key is actually implicit in the tree structure, and you never reach the last level where the leaf values are stored (bringing them into cache) until you have already processed the key and verified that your criteria are met.

If you don't know the allowed weight of mismatch, you should still use some branch-and-bound techniques.

CodePudding user response：

struct uint48
{
  unsigned long long v:48;
} __attribute__((packed));

Do not do that. It tells the compiler the object might be just byte-aligned, so it generally has to use unaligned loads to access it. You would likely be better off with three uint16_t, which the compiler can at least assume are two-byte aligned. Or maybe by telling the compiler it is two-byte aligned instead of packed. (In some situations, depending on cache patterns and other factors, it might even be beneficial to keep one array of uint32_t and another array of uint16_t, and then the compiler can load each with one instruction appropriate for that type.)

Assume an x86_64 based architecture.

That is inadequate to gauge performance. There are a lot of different processor models in that architecture, and they can be configured differently in different systems, with different operating systems applying different settings to them.

Reducing it to 14 will remove any 16 byte alignment benefit if such a thing exists.

There are definitely 16-byte alignment benefits on some x86-64 processors. It is unknown, from the information in your question, whether you are using any of them. If you have diverse fields in a 16-byte structure, I doubt there is much benefit. (Uniform fields possibly could benefit from various SIMD features, like SSE2, AVX, and so on.)

Reducing the size of each entry increases the chance of wanted data being in a cache.

Cache behavior can often be improved more by careful consideration of use patterns and data layout and subsequent algorithm redesign based on those considerations than by slight decreases in the size of data used.

The cost of reducing the memory use is likely a reduction in performance but is there a way to estimate it that would justify performing the actual experiment?

Yes, if you have sufficient information about the data access patterns of your algorithm, the hardware you are using, and so on. No such information is present in the posted question.