What is the efficient way of sorting columnar data-CodePudding

In my program, I need an efficient way of sorting in-memory columnar data.

Let me explain this problem. The data consists of four objects:

(1, 15, ‘apple’), (9, 27, ‘pear’),(7, 38, ‘banana’),(4, 99, ‘orange’)

And the four objects are kept in memory in a columnar, and looks like this:

[1,9,7,4],[15, 27, 38, 99],[‘apple’,’pear’,’banana’,’orange’]

I need to sort this list according to the second column in ascending order and the third in descending order.

In the case of only one columnar, it is a simple sorting problem.

But the case is different when two or more columns exist for in-memory columnar data. The swap function may incur too many overheads during sorting columnar data. I’ve checked several Open-sourced implementations to find the best practice, e.g., Apache Arrow, Presto, TDengine, and other projects. And I found that using index sort is a way that can avoid the overhead introduced by swapping since the index, instead of columnar data, will be swapped.

And I’m wondering is the index sort the most efficient means to handle this problem?

CodePudding user response：

If you want speed then fastest language is C .

You can use std::sort for sorting purposes with parallel execution policy std::execution::par_unseq (which enables multi-threaded multi-core parallel sort).

As you can notice in my code below, I did arg-sort, because you wished for it. But in C it is not really necessary, it is enough to do regular sort, for two reasons.

One is because of cache locality, swapping data elements themselves instead of indices is faster because sorting algorithms are often more or less cache friendly, meaning that swapping/comparing nearby-in-memory elements happens often.

Second is because swapping of elements in std::sort is done through std::swap, which in turn uses std::move, this move function swaps classes like std::string very efficiently by just swapping pointers to data instead of copying data itself.

From above it follows that doing arg-sort instead of regular sort might be even slower.

Following code snippet first creates small example file with several data tuples that you provided. In real case you should remove this file writing code so that you don't overwrite your file. At the end it outputs to new file.

After program finishes see created files data.in and data.out.

Try it online!

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <tuple>
#include <execution>
#include <algorithm>

int main() {
    {
        std::ofstream f("data.in");
        f << R"(
            7, 38, banana
            9, 27, pear
            4, 99, orange
            1, 15, apple
        )";
    }
    std::vector<std::tuple<int, int, std::string>> data;
    {
        std::ifstream f("data.in");
        std::string c;
        while (true) {
            int a = 0, b = 0;
            char comma = 0;
            c.clear();
            f >> a >> comma >> b >> comma >> c;
            if (c.empty() && !f)
                break;
            data.push_back({a, b, c});
        }
    }
    std::vector<size_t> idx(data.size());
    for (size_t i = 0; i < idx.size();   i)
        idx[i] = i;
    std::sort(std::execution::par_unseq, idx.begin(), idx.end(),
        [&data](auto const & i, auto const & j){
            auto const & [_0, x0, y0] = data[i];
            auto const & [_1, x1, y1] = data[j];
            if (x0 < x1)
                return true;
            else if (x0 == x1)
                return y0 > y1;
            else
                return false;
        });
    {
        std::ofstream f("data.out");
        for (size_t i = 0; i < idx.size();   i) {
            auto const & [x, y, z] = data[idx[i]];
            f << x << ", " << y << ", " << z << std::endl;
        }
    }
}

Input:

7, 38, banana
9, 27, pear
4, 99, orange
1, 15, apple

Output:

1, 15, apple
9, 27, pear
7, 38, banana
4, 99, orange

CodePudding user response：

Normally this kind of activities is duty of databases. Database can store data in any sorted order; then no need to sort data every time you want to retrieve them. If you use SQL Server as Database you can create table like:

Create Table TableName (
FirstColumn int not null,
SecondColumn int not null,
ThirdColumn nvarchar(1000) not null, 
Primary Key(SecondColumn ASC, ThirdColumn DESC))

The most important part is specifying clustered index which I choose to be the combination of SecondColumn in ascending order & ThirdColumn in descending order as you requested in the question. Why? To answer the reason you should know the facts, that Clustered Index (Primary Key) specify physical order of data in disk (So already data is in your favorite order sorted). Also you can add more Nonclustered index which covers your query to ensure other queries (which their sort order are not same as physical's order) will retrieve fast enough.

Be aware of poor performance of bad design. So If you are not familiar with Database Design, get help from Database Developers or administrators.