I have a PCIe endpoint device connected to the host. The ep's (endpoints) 512MB BAR is mmapped and memcpy is used to transfer data. Memcpy is quite slow (~2.5s). When I don't map all of the BAR (100bytes), but run memcpy for the full 512MB, I get a segfault within 0.5s, however when reading back the end of the BAR, the data shows the correct data. Meaning that the data reads the same as if I did mmap the whole BAR space.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
Code to map the whole BAR (takes 2.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 536870912, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
Code to map only 100 bytes (takes 0.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
To check the written data, I am using pcimem https://github.com/billfarrow/pcimem
Edit: I was being dumb while consistent data was being 'written' after the segfault, it was not the data that it should have been. Therefore my conclusion that memcpy was completing after the segfault was false. I am accepting the answer as it provided me useful information.
CodePudding user response:
Assuming filename
is just an ordinary file (to save the data), leave off O_SYNC
. It will just slow things down [possibly, a lot].
When opening the BAR device, consider using O_DIRECT
. This may minimize caching effects. That is, if the BAR device does its own caching, eliminate caching by the kernel, if possible.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
The "short" mmap/read
is not working. The extra data comes from the prior "full" mapping. So, your test isn't valid.
To ensure consistent results, do unlink
on the output file. Do open
with O_CREAT
. Then, use ftruncate
to extend the file to the full size.
Here is some code to try:
#define SIZE (512 * 1024 * 1024)
// remove the output file
unlink(filename);
// open output file (create it)
int ofile_fd = open(filename, O_RDWR | O_CREAT,0644)
// prevent segfault by providing space in the file
ftruncate(ofile_fd,SIZE);
map_base = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, ofile_fd, 0);
// use O_DIRECT to minimize caching effects when accessing the BAR device
#if 0
int rand_fd = open(infile, O_RDONLY);
#else
int rand_fd = open(infile, O_RDONLY | O_DIRECT);
#endif
rand_base = mmap(0, SIZE, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, SIZE);
if (munmap(map_base, map_size) == -1) {
PRINT_ERROR;
}
// close the output file
close(ofile_fd);
Depending upon the characteristics of the BAR device, to minimize the number of PCIe read/fetch/transaction requests, it may be helpful to ensure that it is being accessed as 32 bit (or 64 bit) elements.
Does the BAR space allow/support/encourage access as "ordinary" memory?
Usually, memcpy
is smart enough to switch to "wide" memory access automatically (if memory addresses are aligned--which they are here). That is, memcpy
will automatically use 64 bit fetches, with movq
or possibly by using some XMM
instructions, such as movdqa
It would help to know exactly which BAR device(s) you have. The datasheet/appnote should give enough information.
UPDATE:
Thanks for the sample code. Unfortunately, aarch64-gcc gives 'O_DIRECT undeclared' for some reason. Without using that flag, the speed is the same as my original code.
Add #define _GNU_SOURCE
above any #include
to resolve O_DIRECT
The PCIe device is an FPGA that we are developing. The bitstream is currently the Xilinx DMA example code. The BAR is just 512MB of memory for the system to R/W to. – userYou
Serendipitously, my answer was based on my experience with access to the BAR space of a Xilinx FPGA device (it's been a while, circa 2010).
When we were diagnosing speed issues, we used a PCIe bus analyzer. This can show the byte width of the bus requests the CPU has requested. It also shows the turnaround time (e.g. Bus read request time until data packet from device is returned).
We also had to adjust the parameters in the PCIe config registers (e.g. transfer size, transaction replay) for the device/BAR. This was trial-and-error and we (I) tried some 27 different combinations before deciding on the optimum config
On an unrelated arm system (e.g. nVidia Jetson) about 3 years ago, I had to do memcpy
to/from the GPU memory. It may have just been the particular cross-compiler I was using, but the disassembly of memcpy
showed that it only used bytewide transfers. That is, it wasn't as smart as its x86 counterpart. I wrote/rewrote a version that used unsigned long long
[and/or unsigned __int128
] transfers. This sped things up considerably. See below.
So, you may wish to disassemble the generated memcpy
code. Either the library function and/or code that it may inline into your function.
Just a thought ... If you're just wanting a bulk transfer, you may wish to have the device driver for the device program the DMA engine on the FPGA. This might be handled more effectively with a custom ioctl
call to the device driver that accepts a custom struct describing the desired transfer (vs. read
or mmap
from userspace).
Are you writing a custom device driver for the device? Or, are you just using some generic device driver?
Here's what I had to do to get a fast memcpy
on arm. It generates ldp/stp
asm instructions.
// qcpy.c -- fast memcpy
#include <string.h>
#include <stddef.h>
#ifndef OPT_QMEMCPY
#define OPT_QMEMCPY 128
#endif
#ifndef OPT_QCPYIDX
#define OPT_QCPYIDX 1
#endif
// atomic type for qmemcpy
#if OPT_QMEMCPY == 32
typedef unsigned int qmemcpy_t;
#elif OPT_QMEMCPY == 64
typedef unsigned long long qmemcpy_t;
#elif OPT_QMEMCPY == 128
typedef unsigned __int128 qmemcpy_t;
#else
#error qmemcpy.c: unknown/unsupported OPT_QMEMCPY
#endif
typedef qmemcpy_t *qmemcpy_p;
typedef const qmemcpy_t *qmemcpy_pc;
// _qmemcpy -- fast memcpy
// RETURNS: number of bytes transferred
size_t
_qmemcpy(qmemcpy_p dst,qmemcpy_pc src,size_t size)
{
size_t cnt;
size_t idx;
cnt = size / sizeof(qmemcpy_t);
size = cnt * sizeof(qmemcpy_t);
if (OPT_QCPYIDX) {
for (idx = 0; idx < cnt; idx)
dst[idx] = src[idx];
}
else {
for (; cnt > 0; --cnt, dst, src)
*dst = *src;
}
return size;
}
// qmemcpy -- fast memcpy
void
qmemcpy(void *dst,const void *src,size_t size)
{
size_t xlen;
// use fast memcpy for aligned size
if (OPT_QMEMCPY > 0) {
xlen = _qmemcpy(dst,src,size);
src = xlen;
dst = xlen;
size -= xlen;
}
// copy remainder with ordinary memcpy
if (size > 0)
memcpy(dst,src,size);
}
UPDATE #2:
Speaking of serendipity, I am using a Jetson Orin. That is very interesting about the byte-wise behavior.
Just a thought ... If you have a Jetson in the same system as the FPGA, you might get DMA action by judicious use of cuda
Due to requirements, I cannot use any custom kernel modules so I am trying to do it all in userspace.
That is a harsh mistress to serve ... With custom H/W, it is almost axiomatic that you can have a custom device driver. So, the requirement sounds like a marketing/executive one rather than a technical one. If it's something like not being able to ship a .ko
file because you don't know the target kernel version, it is possible to ship the driver as a .o
and defer the .ko
creation to the install script.
We want to use the DMA engine, but I am hiking up the learning curve on this one. We are using DMA in the FPGA, but I thought that as long as we could write to the address specified in the dtb, that meant the DMA engine was set up and working. Now I'm wondering if I have completely misunderstood that part. – userYou
You probably will not get DMA doing that. If you start the memcpy
, how does the DMA engine know the transfer length?
You might have better luck using read/write
vs mmap
to get DMA going, depending upon the driver.
But, if it were me, I'd keep the custom driver option open:
If you have to tweak/modify the BAR config registers on driver/system startup, I can't recall if it's even possible to map the config registers to userspace.
When doing
mmap
, the device may be treated as the "backing store" for the mapping. That is, there is still an extra layer of kernel buffering [just like there is when mapping an ordinary file]. The device memory is only updated periodically from the kernel [buffer] memory.A custom driver can set up a [guaranteed] direct mapping, using some trickery that only the kernel/driver has access to.
Historical note:
When I last worked with the Xilinx FPGA (12 years ago), the firmware loader utility (provided by Xilinx in both binary and source form), would read in bytes from the firmware/microcode
.xsvf
file used (e.g.)fscanf(fi,"%c",&myint)
to get the bytes.This was horrible. I refactored the utility to fix that and the processing of the state machine and reduced the load time from 15 minutes to 45 seconds.
Hopefully, Xilinx has fixed the utility by now.