I am a relative beginner to R trying to load and explore a large (7GB) CSV file.
It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).
It's too large to read straight into R and my searching has made me think the sqldf
package could be useful. But when I try and read the file in with this code ...
library(sqldf)
library(here)
read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")
I get this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 10 did not have 196 elements
Searching around made me think it's because there are missing values in the data. With read.csv
, it looks like you can set fill = TRUE
and get around this. But I can't work out how to do this with the read.csv.sql
function. I also can't actually open the csv in Excel to inspect it because it's too large.
Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).
CodePudding user response:
Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:
- DBMS, whether monolithic (
duckdb
orRSQLite
, lower cost-of-entry) or full DBMS (e.g., PostgreSQL, MariaDB, SQL Server). With this method, you would connect (usingDBI
) to the database (monolithic or otherwise), query for the subset of data you want/need, and work on that data. It is feasible to do in-database aggregation as well, which might be a necessary step in your analysis. - Arrow parquet file. These are directly supported by
dplyr
functions and in a lazy fashion, meaning that when you callopen_dataset("path/to/my.parquet")
, it immediately returns an object but does not load data; you call your dplyrmutate
/filter
/select
/summarize
pipe (some limitations), and then you finally call... %>% collect()
, only then it loads the resulting data into memory. Similar to SQL above in that you work on subsets at a time, but if you're already familiar withdplyr
, it is much much closer than learning SQL from scratch.
There are ways to get a large CSV file into each of this.
- Arrow/Parquet: How to convert a csv file to parquet (python, arrow/drill), a quick search in your favorite search-engine should provide other possibilities; regardless of the language you want to do your analysis in ("R"), don't constrain yourself to solutions using that language.
- SQL: DuckDB (https://duckdb.org/docs/data/csv.html), SQLite (https://www.sqlitetutorial.net/sqlite-import-csv/), and other DBMSes tend to have a "bulk" command for importing raw CSV.