Home > Back-end >  How to define column names while using read.csv.sql on a file without header?
How to define column names while using read.csv.sql on a file without header?

Time:10-29

I often use read.csv function to read large CSV files. The files are without header and therefore by using col.names parameter I define properly the name of the variables in the dataframe that would be created after import.

Today, for the first time, I had to use read.csv.sql which is available in sqldf package. The file to import is very big and I only need certain rows based on a condition in that file. According to the online documentation, the filter has to be defined in the WHERE clause of the SELECT statement. Let's say that I have a column in my file (among other columns) which is user_account and I want to import only rows where the condition user_account = 'Foo' is satisfied. Therefore, I have to write something like

df <- read.csv.sql(
    "my_big_data_file.csv",
    sql = "select * from file where user_account = 'Foo'",
    header = FALSE,
    colClasses = c(... Here I define column types ...),
    sep = "|",
    eol = "\n"
)

Now the problem is, unlike read.csv apparently there is no col.names parameter in read.csv.sql. And given that my file has no header I don't know how to refer to column names. I get an error message as I have written user_account in the WHERE clause of the sql parameter in the above code. R complains that there is no such variable.

So, how can I refer to column names using read.csv.sql for a CSV file without header and at the same time referring to those column names in my filter? Is this even possible?

Thanks in advance

CodePudding user response:

Finally I found the answer in the documentation of read.csv.sql. Instead of colClasses one has to use fields.types by specifying directly data types as they are defined in SQLite and not in R.

field.types: A list whose names are the column names and whose contents are the SQLite types (not the R class names) of the columns. Specifying these types improves how fast it takes. Unless speed is very important this argument is not normally used.

SQLite data types are available here

Therefore I modified my program accordingly :

df_tmp <- read.csv.sql(
    file = input_file_path,
    sql = "
       select
           *
       from
           file
       where trim(lower(user_account)) = 'foo'",
    header = FALSE,
    sep = "|",
    eol = "\n",
    `field.types` = list(
        col1 = c("TEXT"),
        col2 = c("TEXT"),
        user_account = c("TEXT"),
        col4 = c("REAL"),
        col5 = c("REAL")
    ),
    dbname = tempfile(),
    drv = "SQLite"
)

However, at the end I had to convert explicitly via as.numeric one variable that had been converted to character. But the program indicated this by a clear warning message. So at the end, this solution did the job for me.

I hope this might help those who have encountered the same problem.

  • Related