Home > Blockchain >  Does read.csv.sql puts the result dataframe in memory or in the database?
Does read.csv.sql puts the result dataframe in memory or in the database?

Time:01-14

I'm using sqldf package to import CSV files into R and then produce statistics based on the data inserted into the created dataframes. We have a shared lab environment with a lot of users which means that we all share the available RAM on the same server. Although there is a high capacity of RAM available, given the number of users who are often simultaneously connected, the administrator of the environment recommends using some database (PostgreSQL, SQlite, etc.) to import our files into it, instead of importing everything in memory.

I was checking the documentation of sqldf package and the read.csv.sql.function had my attention. Here is what we can read in the documentation :

Reads the indicated file into an sql database creating the database if it does not already exist. Then it applies the sql statement returning the result as a data frame. If the database did not exist prior to this statement it is removed.

However, what I don't understand is, whether the returned result as a dataframe, will be in memory (therefore in the RAM of the server) or like the imported CSV file, it will be in the specified database. Because if it is in memory it doesn't meet my requirement which is reducing the use of the available shared RAM as much as possible given the huge size of my CSV files (tens of gigabytes, often more than 100 000 000 lines in each file)

Curious to see how this works, I wrote the following program

df_data <- suppressWarnings(read.csv.sql(
    file = "X:/logs/data.csv",
    sql = "
       select
           nullif(timestamp, '') as timestamp_value,
           nullif(user_account, '') as user_account,
           nullif(country_code, '') as country_code,
           nullif(prefix_value, '') as prefix_value,
           nullif(user_query, '') as user_query,
           nullif(returned_code, '') as returned_code,
           nullif(execution_time, '') as execution_time,
           nullif(output_format, '') as output_format
       from
           file
      ",
    header = FALSE,
    sep = "|",
    eol = "\n",
    `field.types` = list(
        timestamp_value = c("TEXT"),
        user_account = c("TEXT"),
        country_code = c("TEXT"),
        prefix_value = c("TEXT"),
        user_query = c("TEXT"),
        returned_code = c("TEXT"),
        execution_time = c("REAL"),
        output_format = c("TEXT")
    ),
    dbname = "X:/logs/sqlite_tmp.db",
    drv = "SQLite"
))

I run the above program to import a big CSV file (almost 150 000 000 rows). It took around 30 minutes. During the execution time, as specified via the dbname parameter in the program source code, I saw that a SQLite database file was created in X:/logs/sqlite_tmp.db. As the rows in the file were being imported, this file became bigger and bigger which indicated that all rows were indeed being inserted into the database file on the disk and not in memory (into the server's RAM). Finally, the database file at the end of the import, had reached 30 GB. As stated in the documentation, at the end of the import process, this database was removed automatically. Yet after removing automatically the created SQLite database, I was able to work with the result dataframe (that is, df_data in the above code).

What I understand is that the returned dataframe was in the RAM of the server otherwise I wouldn't have been able to refer to it after the created SQlite database had been removed. Please correct me if I'm wrong, but if that is the case, I think I misunderstood the purpose of this R package. My aim was to put everything, even the result dataframe in a database, and use the RAM only for calculations. Is there anyway to put everything in the database until the end of the program?

CodePudding user response:

The purpose of sqldf is to process data frames using SQL. If you want to create a database and read a file into it you can use dbWriteTable from RSQLite directly; however, if you want to use sqldf anyways then first create an empty database, mydb, then read the file into it and finally check that the table is there. Ignore the read.csv.sql warning. If you add the verbose=TRUE argument to read.csv.sql it will show the RSQLite statements it using.

Also you may wish to read https://avi.im/blag/2021/fast-sqlite-inserts/ and https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/

library(sqldf)

sqldf("attach 'mydb' as new")

read.csv.sql("myfile.csv", sql = 
  "create table mytab as select * from file", dbname = "mydb")
## data frame with 0 columns and 0 rows
## Warning message:
## In result_fetch(res@ptr, n = n) :
##   SQL statements must be issued with dbExecute() or 
##   dbSendStatement() instead of dbGetQuery() or dbSendQuery().

sqldf("select * from sqlite_master", dbname = "mydb")
##    type  name tbl_name rootpage                                     
## .. info on table that was created in mydb ...

sqldf("select count(*) from mytab", dbname = "mydb")
  • Related