My ultimate goal is to use a fast csv parser in C . I have looked at the following libraries:
- https://github.com/ben-strasser/fast-cpp-csv-parser
- https://github.com/vincentlaucsb/csv-parser#integration
- https://github.com/p-ranav/csv2
I have also come across numerous stack-overflow questions regarding CSV Parsing such as:
- Fastest way to get data from a CSV in C
- Parse very large CSV files with C
- Reason behind speed of fread in data.table package in R
My understanding that the fastest way to CSV parse is to use C ( obviously ), memory mapping, and multi-threading.
I've tried many of the solutions above, with csv2
coming out the fastest (https://github.com/p-ranav/csv2)
But none of these are even close to data.table
's fread
. I have tried looking through their source code (https://github.com/Rdatatable/data.table) to try and extract the fread
implementation in C. But I am struggling to incorporate it into my C code.
I believe the relevant files are:
dt_stdio.h
,fread.c
,fread.h
, andmyomp.h
I was wondering if there was an easy way to compile the existing data.table solution into my C codebase.
I think my best solution so far is using csv2
(https://github.com/p-ranav/csv2). This gives very fast memory mapping time. I am struggling with parsing it quickly enough. Even if I just loop through the rows as in their documentation, my time goes to 2 seconds
csv2::Reader<csv2::delimiter<','>,
csv2::quote_character<'"'>,
csv2::first_row_is_header<true>,
csv2::trim_policy::trim_whitespace> csv;
if (csv.mmap(file_name)) {
const auto header = csv.header();
for (const auto &row: csv) {
// if i only loop through rows --> 2 seconds
for (const auto &cell: row) {
// if i run both loops which is probably necessary for parsing --> 17 seconds
// Do something with cell value
// std::string value;
// cell.read_value(value);
}
}
}
EDIT::
I am using G 11.2.0 on Windows.
my G -O option flag was set to 0 previously. changing it to -O3 improved performance ( @Alan Birtles).
Even after changing compiler optimization settings, I get the following results pre-parsing:
Method | Time to Read w/o Parsing | Time to Read Parse |
---|---|---|
data.table | Not Applicable | 2 seconds |
csv2_reader | .003 seconds | 17 seconds |
csv2_reader with = 1 in loops | 6 seconds | 17 seconds |
fastcppcsvparser | 2.5 seconds | 14 seconds |
csv_parser | 17.5 seconds | not worth running |
Is there a way to get data.table
's implementation into C
without using Rcpp
along with RInside
?
Latest Question:
I just downloaded one of the benchmark data-sets. and get the same timing. Maybe I'm misunderstanding something. but adding =1
to count the rows and columns in the loop slows it down from .001 seconds to 6seconds. which seems weird. and then using cell.read_raw_value slows it down even further.
so how am i supposed to access this data in C once its in a memory map? without the huge performance loss. Similar to whatever R's data.table
does
Chat: https://chat.stackoverflow.com/rooms/242552/c-csv-parsing
CodePudding user response:
The question "can I call this c code from c " is "yes you can" (unless there is something truly weird going on. Have to avoid name mangling tho
the trick is this
extern "C" {
#include "somecode.h"
}
see Call a C function from C code
But really c should be able to produce a csv parser that is the same speed as a c one, there is nothing that c can do that c cannot
CodePudding user response:
I have tried the following in C
// main.cpp
extern "C" {
#include <fread.h>
}
#include <iostream>
int main(int argc, char* argv[]) {
return 0;
}
I then use the following to compile and link:
gcc -Iinclude -c -o fread.o fread.c
g -Iinclude -c -o main.o main.cpp
g -Iinclude -o main.exe main.o fread.o
But then get the following errors on compiling:
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x1d0): undefined reference to `libintl_dgettext'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x1da): undefined reference to `Rprintf'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x42e): undefined reference to `libintl_dgettext'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x447): undefined reference to `__halt'
... alot more things
I have included the relevant files to compile in the include
folder in my directory. And started this issue for potential C implementation:
https://github.com/Rdatatable/data.table/issues/5343