Home > Software engineering >  How can I fix the 'line x did not have y elements' error when trying to use read.csv.sql?
How can I fix the 'line x did not have y elements' error when trying to use read.csv.sql?

Time:12-25

I am a relative beginner to R trying to load and explore a large (7GB) CSV file.

It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).

It's too large to read straight into R and my searching has made me think the sqldf package could be useful. But when I try and read the file in with this code ...

library(sqldf)
library(here)

read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")

I get this error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 10 did not have 196 elements

Searching around made me think it's because there are missing values in the data. With read.csv, it looks like you can set fill = TRUE and get around this. But I can't work out how to do this with the read.csv.sql function. I also can't actually open the csv in Excel to inspect it because it's too large.

Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).

CodePudding user response:

Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:

  • DBMS, whether monolithic (duckdb or RSQLite, lower cost-of-entry) or full DBMS (e.g., PostgreSQL, MariaDB, SQL Server). With this method, you would connect (using DBI) to the database (monolithic or otherwise), query for the subset of data you want/need, and work on that data. It is feasible to do in-database aggregation as well, which might be a necessary step in your analysis.
  • Arrow parquet file. These are directly supported by dplyr functions and in a lazy fashion, meaning that when you call open_dataset("path/to/my.parquet"), it immediately returns an object but does not load data; you call your dplyr mutate/filter/select/summarize pipe (some limitations), and then you finally call ... %>% collect(), only then it loads the resulting data into memory. Similar to SQL above in that you work on subsets at a time, but if you're already familiar with dplyr, it is much much closer than learning SQL from scratch.

There are ways to get a large CSV file into each of this.

  • Related