I often use read.csv function to read large CSV files. The files are without header and therefore by using col.names parameter I define properly the name of the variables in the dataframe that would be created after import.
Today, for the first time, I had to use read.csv.sql which is available in sqldf package. The file to import is very big and I only need certain rows based on a condition in that file. According to the online documentation, the filter has to be defined in the WHERE
clause of the SELECT
statement. Let's say that I have a column in my file (among other columns) which is user_account
and I want to import only rows where the condition user_account = 'Foo'
is satisfied. Therefore, I have to write something like
df <- read.csv.sql(
"my_big_data_file.csv",
sql = "select * from file where user_account = 'Foo'",
header = FALSE,
colClasses = c(... Here I define column types ...),
sep = "|",
eol = "\n"
)
Now the problem is, unlike read.csv
apparently there is no col.names
parameter in read.csv.sql
. And given that my file has no header I don't know how to refer to column names. I get an error message as I have written user_account
in the WHERE
clause of the sql parameter in the above code. R complains that there is no such variable.
So, how can I refer to column names using read.csv.sql
for a CSV file without header and at the same time referring to those column names in my filter? Is this even possible?
Thanks in advance
CodePudding user response:
Finally I found the answer in the documentation of read.csv.sql
. Instead of colClasses
one has to use fields.types
by specifying directly data types as they are defined in SQLite and not in R.
field.types: A list whose names are the column names and whose contents are the SQLite types (not the R class names) of the columns. Specifying these types improves how fast it takes. Unless speed is very important this argument is not normally used.
SQLite data types are available here
Therefore I modified my program accordingly :
df_tmp <- read.csv.sql(
file = input_file_path,
sql = "
select
*
from
file
where trim(lower(user_account)) = 'foo'",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list(
col1 = c("TEXT"),
col2 = c("TEXT"),
user_account = c("TEXT"),
col4 = c("REAL"),
col5 = c("REAL")
),
dbname = tempfile(),
drv = "SQLite"
)
However, at the end I had to convert explicitly via as.numeric
one variable that had been converted to character. But the program indicated this by a clear warning message. So at the end, this solution did the job for me.
I hope this might help those who have encountered the same problem.