How to optimize omp parallelization when batching-CodePudding

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).

Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.

        bool
   Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
    {
        auto ptr_cube = &cube;
        auto npol = cubes.size();
        auto ptr_cubes = cubes.data();
    
    const auto nthreads = omp_get_max_threads();

    bool check = false;

#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
    {
#pragma omp single nowait
        {
            const auto batch_size = npol / nthreads;
            for (int32_t i = 0; i < nthreads; i  )
            {
                const auto bstart = batch_size * i;
                const auto bend = ((bstart   batch_size) > npol) ? npol  : bstart   batch_size;

#pragma omp task firstprivate(i, bstart, bend) shared (check)
                {
                    struct bd bd1{}, bd2{};
                    bd1 = allocate_bd();
                    bd2 = allocate_bd();

                    for (auto j = bstart; j < bend; j  )
                    {
                        bool loc_check;
                        loc_check = check;
                        if (loc_check) break;

                        if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
                        {
#pragma omp atomic write
                            check = true;
                            break;
                        }
                    }
                    free_bd(&bd1);
                    free_bd(&bd2);
                }
            }
        }
    }
    return check;
}

UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:

 bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
    const auto nom = number_of_cuboids();
    const auto onom = other.number_of_cuboids();

    for (int32_t i = 0; i < nom; i  )
    {
        get_mcoord(i, bd1);

        for (int32_t j = 0; j < onom; j  )
        {
            other.get_mcoord(j, bd2);
            if (check_gjk_intersection(bd1, bd2))
            {
                return true;
            }
        }
    }
    return false;
}

//get_mcoord calculates vertices of the cuboids

   void
    Cube::get_mcoord(int32_t index, struct bd *bd) const
    {
        for (int32_t i = 0; i < 8; i  )
        {
            for (int32_t j = 0; j < 3; j  )
            {
                bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
            }
        }
    }

inline struct bd
allocate_bd()
{
    struct bd bd{};

    bd.numpoints = 8;

    bd.coord = (double **) malloc(8 * sizeof(double *));

    for (int32_t i = 0; i < 8; i  )
    {
        bd.coord[i] = (double *) malloc(3 * sizeof(double));
    }
    return bd;
}

CodePudding user response：

The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.

With recent versions of OpenMP (5, I think) there is a solution for that.

(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.

CodePudding user response：

Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).

It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).

You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.

Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.

Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.