memcpy Performance depends on destination vector-CodePudding

This particular issue came up while doing some profiling of a HPC code. Here is a minimum working example.

I have a source data vector that I copy to a large destination vector and in another place I copy the vector to a vector of same size. What I found is copying to the large destination vector takes longer (~2.5X), even though the amount of data being copied is same. I need some help to understand the behavior and also some guidance on how to improve the performance when copying to the large destination vector.

#include <iostream>
#include <vector>
#include <memory>
#include <chrono>
#include <cstring>
#include <cstdlib>

int main() {
  size_t tw = 1024, th=1024, td=10;

  std::vector<uint16_t> src_data(tw*th);
  for(auto i=0;i<tw*th;i  ){src_data[i] = rand()0;}
  
  std::shared_ptr<std::vector<uint16_t>> large_tile_data_1 = 
    std::make_shared<std::vector<uint16_t>>(tw*th*td) ;

  auto start = std::chrono::high_resolution_clock::now(); 
  for (auto i=0;i<td;  i){
    auto offset = i*tw*th;
    std::memcpy((void *)(&(*(large_tile_data_1->data() offset))), 
      (void *)(&(src_data[0])), sizeof(uint16_t)*tw*th);
  }
  
  auto end = std::chrono::high_resolution_clock::now(); 
  std::chrono::duration<double> elapsed_seconds = end-start;
  std::cout<<"elapsed_time " << elapsed_seconds.count() << std::endl;

  
  std::shared_ptr<std::vector<uint16_t>> small_tile_data_2 =      std::make_shared<std::vector<uint16_t>>(tw*th) ;

  start = std::chrono::high_resolution_clock::now(); 
  for (auto i=0;i<td;  i){
    auto offset = 0;
    std::memcpy((void *)(&(*(small_tile_data_2->data() offset))), (void *)(&(src_data[0])), 
      sizeof(uint16_t)*tw*th);
  }
  end = std::chrono::high_resolution_clock::now(); 
  elapsed_seconds = end-start;
  std::cout<<"elapsed_time " << elapsed_seconds.count() << std::endl;
}

CodePudding user response：

The answer is almost certainly caching of some form; either CPU data cache, or TLB, or memory allocation behavior.

In your test code, the second part overwrites the same 2 MiB of data over and over again. That size fits nicely into the L3 cache of most CPUs. And even if it doesn't, modern LLC (last level caches) may be smart enough to switch to an adaptive replacement strategy that works for this case.

The repeated overwriting or otherwise smaller work set will also help with other aspects. It increases the chance that the destination memory is in the TLB (transaction lookaside buffer), for example.

Another aspect is that in the first case, you allocate several megabytes of memory and initialize them once, instead of repeatedly. Such large allocations are likely backed by a separate memory mapping. Those are initialized lazily by the operating system and incur a high overhead on first use. In your second, smaller use, this cost is only paid once in the first loop.

Likewise, in a real program, smaller allocations have a higher chance of being cached by the malloc implementation instead of being released to the operating system, which makes reuse in repeated allocation-deallocation cycles faster.

CodePudding user response：

Since you are copying the same size of data I would say cache effects are less likely. But profile your code and check the amount of cache misses to make sure that isn't what you are seeing.

Lets look at another candidate for your slowdown:

In most C libraries the memcpy() has code to deal with copying data aligned to some architecture defined boundaries and large size and a fallback for the generic case.

For example on ARM you can load or store a double word (2 registers) in a single instruction if the memory is suitably aligned. On x86 you can leverage SSE registers to load 16, 32 or even 64 byte at a time and you need both source and destination to be properly aligned for it to be fast.

So if your data is aligned to 16 byte for both source and destination you get one function to copy the data and if not you get a different function (or the CPU simply is slower doing it). Memory allocated via malloc() or new is generally 16 byte aligned (aligned to the architectures maximum requirement of any fundamental type) and will be optimal for memcpy().

But if you then go and copy data with an offset into the vector you will destroy that optimal alignment and get a slower memcpy() either due to the CPU having a harder time or memcpy() having to use less efficient code to realign the data on the fly.

CodePudding user response：

The problem is where you focused. Your code is actually slower in the first memcpy loop compared to the second. But the size of the destination is not the only difference, those two loops don't do the same thing. If you make them do the same thing, they use the same amount of time regardless the size of the destination.

Replace the offset in the first loop to be the same as the second one e try again.

I tried for you:

Your version

Offest = 0 for the large vector too

These are just examples, use something more reliable than godbolt to benchmark the code.

Other answers and comments already explained what can cause this behavior. Optimizations, cache misses and those other cool words.