read.csv.sql inserts all rows despite the WHERE clause in its SQL clause-CodePudding

I use the following (pseudo) code to import huge CSV files in R when only a rather small portion of the data in the file has to be treated/imported and other rows should simply be ignored. I don't put data in-memory but in a SQLite database which is created by the program.

library(sqldf)
suppressWarnings(read.csv.sql(
    file = input_file_path,
    sql = "
       create table mytable as 
       select
           . . .
       from
           file
       where
           . . .
      ",
    header = FALSE,
    sep = "|",
    eol = "\n",
    `field.types` = list( . . . )
    dbname = sqlite_db_path,
    drv = "SQLite"
))

If I understand the documentation of read.csv.sql correctly, the WHERE clause in the CREATE TABLE statement above, guarantees that only rows satisfying the condition of the WHERE clause will be inserted in the created table mytable. However, last night during a test I found somewhat a strange behaviour in my program. I had to import a big CSV file (more than 100 000 000 rows), but only a set of 10 000 rows was the target and all other rows had to be ignored. The above program did the job and at the end, the created table mytable indicated in the code above, had only 10 000 rows as expected which proves that the condition in the WHERE clause was taken into account. However, when I checked the size of the created SQLite database, it was abnormally huge (more than 25 GB). That cannot possibly be the size of 10 000 rows.

Therefore, I'm confused. What happened during the import process? Why the SQLite database file became so huge depsite the fact that only 10 000 rows were inserted in mytable and everything else was ignored? Is this the normal behaviour or it's me who is not using correctly read.csV.sql?

CodePudding user response：

edit: this is an elaboration on the suggestions @G.Grothendieck makes in the comment-section above.

I do not have experience with read.csv.sql, but you might want to take a look at importing the output from a shell-command. This way, you preprocess the csv, and only import the result.

For eaxmple, data.table::fread(cmd = 'findstr "^[^#]" test.csv', sep = "\n", header = FALSE) imports all lines from the csv test.csv that do not start with a #-character.

So it import the output from the (windows) command: findstr "^[^#]" test.csv to a data.table, using all the power of data.table::fread() and it's arguments. This will also work on *nix-environments, but you'll have to use the appropriate shell-commands (like grep).

Note that an approach like this might work very well (and fast!), but it makes it impossible to swap your code around on multiple operating systems.

CodePudding user response：

Thanks to @G. Grothendieck for his clarification :

It first uses dbWriteTable to write the file to the database. Then it operates on that. You co consider using grep/findstr/xsv to extract out the rows of interest piping that direclty into R. read.table(pipe("grep ..."), sep = ",") or in Windows findstr or the xsv or other csv utility and use that instead of grep. –