Why is multi-threading of matrix calculation not faster than single-core?-CodePudding

this is my first time using multi-threading to speed up a heavy calculation.

Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().

I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.

On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.

main.cpp

int main(int argc, char *argv[]) {
    double sigma0sq = 1;
    double lengthScale [] = {0.7633, 0.6937, 3.3307e 07};
    
    const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
    
    /* Finding data slices of similar size */
    //This piece of code works, each thread is assigned roughly the same number of matrix entries
    int numElements = x_test.size()*x_test.size()/2;
    const int numThreads = 4;
    int elemsPerThread = numElements / numThreads;
    std::vector<int> indices;

    int j = 0;
    for(std::size_t i=1; i<x_test.size() 1;   i){
        int prod = i*(i 1)/2 - j*(j 1)/2;
        if (prod > elemsPerThread) {
            i--;
            j = i;
            indices.push_back(i);
            if(indices.size() == numThreads-1)
                break;
        }
    }
    indices.insert(indices.begin(), 0);
    indices.push_back(x_test.size());

    /* Spreding calculations to multiple threads */
    std::vector<std::thread> threads;

    for(std::size_t i = 1; i < indices.size();   i){
        threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
    }

    for(auto & th: threads){
        th.join();
    }
    return 0;
}

As you can see, each thread performs the following calculations on the data assigned to it:

void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
    char buffer[8192];

    std::ofstream out("lower_half_matrix_"   std::to_string(threadCounter)  ".csv");
    out.rdbuf()->pubsetbuf(buffer, 8196);

    for(int i = start; i < stop;   i){
        for(int j = 0; j < i 1;   j){
            double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
            if (j!=0)
                out << ',';
            out << kij;
        }
        if(i!=xtest.size()-1 )
            out << '\n';
    }
    out.close();
}

and

double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
    double sum(0);
    for(std::size_t i=0; i<x1.size();i  ){
        sum  = pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
    }
    return sigma0sq*exp(-0.5*sum);
}

Aspects I considered

locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest

Compiler and machine

Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)

CodePudding user response：

There are many details that can be improved in your code, but I think the two biggest issues are:

using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.

The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:

using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;

The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.

Here is a way to do exactly that:

void calculateKMatrixCpp(
    std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
    int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
    // ...
    double& kij = kMatrix[i * xtest.size()   j];
    kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
    // ...
}
// ...
threads.push_back(std::thread(
    calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
    i, indices[i-1], indices[i], std::ref(kMatrix)
));

Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.

Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.

Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.

If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.

#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>

using Point3D = std::array<double, 3>;

auto generateSampleData() -> std::vector<Point3D> {
    static std::minstd_rand g(std::random_device{}());
    std::uniform_real_distribution<> d(-1.0, 1.0);

    std::vector<Point3D> data;
    data.reserve(27000);
    for (auto i = 0; i < 27000;   i) {
        data.push_back({ d(g), d(g), d(g) });
    }
    return data;
}

double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
    double sum = 0.0;
    for (auto i = 0u; i < 3u;   i) {
        double distance = (x1[i] - x2[i]) / lengthScale[i];
        sum  = distance*distance;
    }
    return sigma0sq * std::exp(-0.5*sum);
}

void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
    std::cout << "start of thread " << threadCounter << "\n" << std::flush;
    for(int i = start; i < stop;   i) {
        for(int j = 0; j < i 1;   j) {
            double& kij = kMatrix[i * xtest.size()   j];
            kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
        }
    }
    std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}

int main() {
    double sigma0sq = 1;
    Point3D lengthScale = {0.7633, 0.6937, 3.3307e 07};

    const std::vector<Point3D> x_test = generateSampleData();

    /* Finding data slices of similar size */
    //This piece of code works, each thread is assigned roughly the same number of matrix entries
    int numElements = x_test.size()*x_test.size()/2;
    const int numThreads = 4;
    int elemsPerThread = numElements / numThreads;
    std::vector<int> indices;

    int j = 0;
    for(std::size_t i = 1; i < x_test.size() 1;   i){
        int prod = i*(i 1)/2 - j*(j 1)/2;
        if (prod > elemsPerThread) {
            i--;
            j = i;
            indices.push_back(i);
            if(indices.size() == numThreads-1)
                break;
        }
    }
    indices.insert(indices.begin(), 0);
    indices.push_back(x_test.size());

    auto start = std::chrono::system_clock::now();
    std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);

    std::vector<std::thread> threads;
    for (std::size_t i = 1; i < indices.size();   i) {
        threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
    }

    for (auto& t : threads) {
        t.join();
    }
    auto end = std::chrono::system_clock::now();
    auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
    std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;


    start = std::chrono::system_clock::now();
    constexpr int buffer_size = 131072;
    char buffer[buffer_size];
    std::ofstream out("matrix.csv");
    out.rdbuf()->pubsetbuf(buffer, buffer_size);

    for (int i = 0; i < x_test.size();   i) {
        for (int j = 0; j < i   1;   j) {
            if (j != 0) {
                out << ',';
            }
            out << kMatrix[i * x_test.size()   j];
        }

        if (i != x_test.size() - 1) {
            out << '\n';
        }
    }
    end = std::chrono::system_clock::now();
    elapsed_seconds = std::chrono::duration<double>(end - start).count();
    std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}