Home > Net >  Is there anyway to use data.table's fread C implementation in C ?
Is there anyway to use data.table's fread C implementation in C ?

Time:03-03

My ultimate goal is to use a fast csv parser in C . I have looked at the following libraries:

I have also come across numerous stack-overflow questions regarding CSV Parsing such as:

My understanding that the fastest way to CSV parse is to use C ( obviously ), memory mapping, and multi-threading.

I've tried many of the solutions above, with csv2 coming out the fastest (https://github.com/p-ranav/csv2)

But none of these are even close to data.table's fread. I have tried looking through their source code (https://github.com/Rdatatable/data.table) to try and extract the fread implementation in C. But I am struggling to incorporate it into my C code.

I believe the relevant files are:

  • dt_stdio.h, fread.c, fread.h, and myomp.h

I was wondering if there was an easy way to compile the existing data.table solution into my C codebase.

I think my best solution so far is using csv2 (https://github.com/p-ranav/csv2). This gives very fast memory mapping time. I am struggling with parsing it quickly enough. Even if I just loop through the rows as in their documentation, my time goes to 2 seconds

csv2::Reader<csv2::delimiter<','>, 
        csv2::quote_character<'"'>, 
        csv2::first_row_is_header<true>,
        csv2::trim_policy::trim_whitespace> csv;
               
    if (csv.mmap(file_name)) {
        const auto header = csv.header();
        for (const auto &row: csv) {
            // if i only loop through rows --> 2 seconds
            for (const auto &cell: row) {
                // if i run both loops which is probably necessary for parsing --> 17 seconds

                // Do something with cell value
                // std::string value;
                // cell.read_value(value);
            }
        }
    }

EDIT::

I am using G 11.2.0 on Windows.

my G -O option flag was set to 0 previously. changing it to -O3 improved performance ( @Alan Birtles).

Even after changing compiler optimization settings, I get the following results pre-parsing:

Method Time to Read w/o Parsing Time to Read Parse
data.table Not Applicable 2 seconds
csv2_reader .003 seconds 17 seconds
csv2_reader with = 1 in loops 6 seconds 17 seconds
fastcppcsvparser 2.5 seconds 14 seconds
csv_parser 17.5 seconds not worth running

Is there a way to get data.table's implementation into C without using Rcpp along with RInside?

Latest Question:

I just downloaded one of the benchmark data-sets. and get the same timing. Maybe I'm misunderstanding something. but adding =1 to count the rows and columns in the loop slows it down from .001 seconds to 6seconds. which seems weird. and then using cell.read_raw_value slows it down even further.

so how am i supposed to access this data in C once its in a memory map? without the huge performance loss. Similar to whatever R's data.table does

Chat: https://chat.stackoverflow.com/rooms/242552/c-csv-parsing

CodePudding user response:

The question "can I call this c code from c " is "yes you can" (unless there is something truly weird going on. Have to avoid name mangling tho

the trick is this

extern "C" {
  #include "somecode.h"
}

see Call a C function from C code

But really c should be able to produce a csv parser that is the same speed as a c one, there is nothing that c can do that c cannot

CodePudding user response:

I have tried the following in C

// main.cpp
extern "C" {
    #include <fread.h>
}

#include <iostream>
int main(int argc, char* argv[]) {

    return 0;
}

I then use the following to compile and link:

gcc -Iinclude -c -o fread.o fread.c
g  -Iinclude -c -o main.o main.cpp
g   -Iinclude -o main.exe main.o fread.o

But then get the following errors on compiling:

x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x1d0): undefined reference to `libintl_dgettext'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x1da): undefined reference to `Rprintf'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x42e): undefined reference to `libintl_dgettext'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text 0x447): undefined reference to `__halt'
... alot more things

I have included the relevant files to compile in the include folder in my directory. And started this issue for potential C implementation: https://github.com/Rdatatable/data.table/issues/5343

  • Related