std::accumulate vs for loop, raytracing application-CodePudding

This question is based on this video on YouTube made with the purpose of reviewing this project.

In the video, the host is analyzing the project and found out that the following block of code is a cause of performance issues:

std::optional<HitRecord> HittableObjectList::Hit(const Ray &r, float t_min, float t_max) const {
    float closest_so_far = t_max;

    return std::accumulate(begin(objects), end(objects), std::optional<HitRecord>{},
        [&](const auto &temp_value, const auto &object) {
            if(auto temp_hit = object -> Hit(r, t_min, closest_so_far); temp_hit) {
                closest_so_far = temp_hit.value().t;
                return temp_hit;
            }

            return temp_value;
    });
}

I would assume that the std::accumulate function would function similarly to a for loop. Unhappy with the performance hit there (and because, for some reason, the profiler wouldn't profile the lambda code[a limitation, perhaps?]), the reviewer changed the code to this:

std::optional<HitRecord> HittableObjectList::Hit(const Ray &r, float t_min, float t_max) const {
    float closest_so_far = t_max;
    std::optional<HitRecord> record{};

    for(size_t i = 0; i < objects.size(); i  ) {
        const std::shared_ptr<HittableObject> &object = objects[i];

        if(auto temp_hit = object -> Hit(r, t_min, closest_so_far); temp_hit) {
            closest_so_far = temp_hit.value().t;
            record = temp_hit;
        }
    }

    return record;
}

With this change the time to completion went from 7 minutes and 30 seconds to 22 seconds.

My questions are:

Aren't both blocks of code identical? Why does std::accumulate give such enormous penalty here?
Would the performance be better if instead of using autos, using the explicit type?

The reviewer did mention suggestions such as avoiding the use of std::optionals and std:shared_ptrs here due to the amount of calls made and to execute this code on the GPU instead, but for now I'm only interested in those points mentioned earlier.

CodePudding user response：

Desclaimer: I did not run advanced tests, this is just my analysis based on the video and the code.

From what I see in the profiling in the video, the hotspot in accumulate is here:

_Val = _Reduce_op(_Val, *_UFirst);

Since _Reduce_op is just our lambda, and the profiling shows this lambda is not the bottleneck, then it means the only expensive operation here is the copy assignment operator =.

Looking at HitRecord:

struct HitRecord {
  point3 p;
  vec3 normal;
  std::shared_ptr<Material> mat_ptr;
  float t;
  bool front_face;
  ...

We see there is a bunch of stuff including a shared_ptr. Chances are the optimizer would remove the copy when it is not needed if the shared_ptr was not here. Copying a shared_ptr is relatevely expensive in a hot loop because it involves atomic operations.

Note that in the profiled accumulate code, we see that they tried to fix this in c 20 by introducing a move.

#if _HAS_CXX20
        _Val = _Reduce_op(_STD move(_Val), *_UFirst);
#else // ^^^ _HAS_CXX20 ^^^ // vvv !_HAS_CXX20 vvv
        _Val = _Reduce_op(_Val, *_UFirst);
#endif // _HAS_CXX20

Though for this move to work, the compiler would have to properly use the named return value optimization which it does not always do when there are multiple return in a function. You would also have to change the signature of the lambda so that it takes a value or an r-value instead of a reference. Changing from auto to a named type would not fix the issue.