I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
The things I can think of that may make a difference would be:
- For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
- Compression would also be easier on a column-store regardless of whether in memory or not (seems like a non-issue if not saving back to storage I suppose? does compression ever matter on in-memory operations?)
- Possible to vectorize operations.
- Much, much easier to work with a
struct
on a row-by-row basis of course.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
CodePudding user response:
I'm familiar with using a column- vs a row-store for how a databases internally persists data to disk. My question is whether, for a dataset is entirely in memory, and there's no storage to disk, if the row- vs column- orientation makes much of a difference?
A lot depends on the size of the dataset, what the contents of each row are, how you need to search in it, whether you want to add items to or remove items from the dataset, and so on.
There is also the CPU and memory architecture to consider; how big are your caches, what is the size of a cache line, and how intelligent is your CPU's prefetcher.
For fields under 8 bytes, it would involve less memory accesses for columns than for rows.
Memory is not accessed a register at a time, but rather a cache line at a time. On most contemporary machines, cache lines are 64 bytes.
Compression would also be easier on a column-store regardless of whether in memory or not
Not really. You can compress/decompress a column even if it is not stored in memory consecutively. It might be faster though.
does compression ever matter on in-memory operations?
That depends. If it's in-memory, then it's likely that compression will reduce performance, but on the other hand, the amount of data that you need to store is smaller, so you will be able to fit more into memory.
Possible to vectorize operations.
It's only loading/storing to memory that might be slower if data is grouped by rows.
Much, much easier to work with a struct on a row-by-row basis of course.
It's easy to use a pointer to a struct
with a row-by-row store, but with C you can make classes that hide the fact that data is stored column-by-column. That's a bit more work up front, but might make it as easy as row-by-row once you have set that up.
Also, column-by-column store is often used in the entity-component-system pattern, and there are libraries such as EnTT that make it quite easy to work with.
Are both of those accurate, and are there any more? Given this, would there be substantial performance improvements on using an in-memory colstore vs rowstore on a read-only dataset, or just a marginal improvement?
Again, it heavily depends on the size of the dataset and how you want to access it. If you frequently use all columns in a row, then row-by-row store is preferred. If you frequently just use one column, and need to access that column of many consecutive rows, then a column-by-column store is best.
Also, there are hybrid solutions possible. You could have one column on its own, and then all the other columns stored in row-by-row fashion.
How you will search in a read-only dataset matters a lot. Is it going to be sorted, or is it more like a hash map? In the former case, you want the index to be as compact as possible, and possibly ordered like a B-tree as Alex Guteniev already mentioned. If it's going to be like a hash map, then you probably want row-by-row.
CodePudding user response:
For in-memory arrays, this is called AoS vs SoA (array of structs vs struct of arrays).
I think the main advantage in SoA for a read-only database is that searches would need to access smaller memory range. This is more cache friendly, less prone to page faults.
The amount of improvement depends on how you use the database. There may be some more significant improvement by using more targetted structure (sorted array, B-tree)