C algorithm to sum contiguous blocks of integers-CodePudding

Given a block size N and a vector of integers of length k * N which can be viewed as k blocks of N integers, I want to create a new vector of length k whose elements are the sums of the blocks of the original vector.

E.g. block size 2, vector {1,2,3,4,5,6} would give a result of {3,7,11}.

E.g. block size 3, vector {0,0,0,1,1,1} would give a result of {0,3}.

A simple approach that works:

std::vector<int> sum_blocks(int block_size, const std::vector<int>& input){
    std::vector<int> ret(input.size() / block_size, 0);

    for (unsigned int i = 0; i < ret.size();   i)
    {
        for(unsigned int j=0; j < block_size;   j)
        ret[i]  = input[block_size * i   j];
    }
    return ret;
}

However I'd be interested in knowing if there is a neater or more efficient way of doing this, possibly using the algorithm library.

CodePudding user response：

If you can use the range-v3 library, you could write the function like this:

namespace rv = ranges::views;
namespace rs = ranges;

auto sum_blocks(int block_size, std::vector<int> const & input)
{
  return input 
         | rv::chunk(block_size) 
         | rv::transform([](auto const & block) {
             return rs::accumulate(block, 0);
           }) 
         | rs::to<std::vector<int>>;
}

which is quite self-explanatory, and avoids doing any arithmetic like block_size * i j, which is error prone.

demo.

CodePudding user response：

Neat is often orthognal to efficient, if not antagonising. Many pedal-to-the-metal approaches may end up looking incomprehensible compared to an initial solution, e.g. you could use Duff's device to unroll the inner loops based on your block size, or use vector instructions to speed up additions.

That been said, your problem falls "neatly" into a stencil pattern; put simply, when the result space is some combination of neighbors in the origin space then you have a stencil. In this case your stencil is a 1-D "window" of size=block_size and the "combination" operation is an addition. This is the bread and butter of GPU computing and depending on the problem size it may be beneficial to treat it as a stencil in CPU/C world as well.

Even though the sizes mentioned in the comments are hardly large enough to benefit from parallelization, here's an example of how to do this:

// A stencil captures the src and result arrays
// Each element of the result depends on block size elements of the source
auto stencil = [&src, &ret, blockSz](size_t start, size_t end){
        for (size_t i(start); i < end;   i)
        {
            ret[i] = std::accumulate(
                src.begin()   i * blockSz, 
                src.begin()   i * blockSz   blockSz, 0);
        }
    };

Now having such a utility means that you can divide your work up to workers, assigning a range of the result to each one of them. Here's a dummy example of how to do this, note that having a thread pool would be much more efficient than on-the-fly assignment of work via std::async:

auto block_reduce(std::vector<int> src, size_t blockSz, size_t nWorkers)
{
    std::vector<int> ret(src.size() / blockSz, 0);
    
    auto stencil = [&src, &ret, blockSz](size_t start, size_t end){
        for (size_t i(start); i < end;   i)
        {
            ret[i] = std::accumulate(
                src.begin()   i * blockSz, 
                src.begin()   i * blockSz   blockSz, 0);
        }
    };
    
    size_t chunk = ret.size() / nWorkers;
    size_t iStart = 0, iEnd = chunk;
    std::vector<std::future<void>> results;
    
    while (iEnd < ret.size())
    {
        results.push_back(std::async(stencil, iStart, iEnd));
        iStart = iEnd;
        iEnd  = chunk;
    }
    stencil(iStart, ret.size());
    
    while (!results.empty())
    {
        results.back().wait();
        results.pop_back();
    }
    
    return ret;
}

Demo

Note that "industrial" implementations would mandate careful consideration of corner cases which I haven't done in the demo (e.g. requested workers > actual data or runtime decision of workers number)

CodePudding user response：

I'd first of all like to note that Nikos posted a great answer regarding efficieny. This problem is "embarrassingly" parallel, as you can start summing one block, regardless of the state of the summation of other blocks. Basically, you can sum all blocks independently. So, if performance is of essence, your GPU is about to sweat.

Anyways, here's a more C beginner-friendly answer, that uses a single for-loop and the numeric library and it's transform function. Sadly OP I couldn't see a simple way to use the algorithm library.

std::vector<int> sum_blocks(int block_size, const std::vector<int> &input){
    std::vector<int> result(input.size() / block_size, 0);
    for(int i = 0; i < result.size(); i  ){
        result[i] = std::accumulate(input.begin()   i*block_size, input.begin()   (i 1)*block_size, 0);
    }
    return result;
}

I've also written a quick example that uses iterators only:

std::vector<int> sum_blocks(int block_size, const std::vector<int> &input){
    std::vector<int> result(input.size() / block_size, 0);
    auto it1 = result.begin();
    auto it2 = input.begin();
    while(it1 != result.end()){
        *it1 = std::accumulate(it2, it2   block_size, 0);
        std::advance(it1, 1);  // same as it1  ;
        std::advance(it2, block_size); // same as it2  = block_size;
    }
    return result;
}

In the example I've used the function advance to keep the code more uniform, but it's identical to the code in the comments.

Once again, this answer is oriented towards the neat-based request, it is not the efficient way (especially not compared to using using parallelization techniques).