I'm reading a moderately big CSV with fread
but it errors out with "R character strings are limited to 2^31-1 bytes". readLines
works fine, however. Pinning down the faulty line (2965), I am not sure what's going on: doesn't seem longer than the next one, for example.
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread
error?
library(data.table)
options(timeout=10000)
download.file("https://s3.amazonaws.com/nyc-tlc/trip data/yellow_tripdata_2010-03.csv",
destfile = "trip_data.csv", mode = "wb")
dt = fread("trip_data.csv")
#> Error in fread("trip_data.csv"): R character strings are limited to 2^31-1 bytes
lines = readLines("trip_data.csv")
dt2955 = fread("trip_data.csv", nrows = 2955)
#> Warning in fread("trip_data.csv", nrows = 2955): Previous fread() session was
#> not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
dt2956 = fread("trip_data.csv", nrows = 2956)
#> Error in fread("trip_data.csv", nrows = 2956): R character strings are limited to 2^31-1 bytes
lines[2955]
#> [1] "CMT,2010-03-07 18:37:05,2010-03-07 18:41:51,1,1,-73.984211000000002,40.743720000000003,1,0,-73.974515999999994,40.748331,Cre,4.9000000000000004,0,0.5,1.0800000000000001,0,6.4800000000000004"
lines[2956]
#> [1] "CMT,2010-03-07 22:59:01,2010-03-07 23:01:04,1,0.59999999999999998,-73.992887999999994,40.703017000000003,1,0,-73.992887999999994,40.703017000000003,Cre,3.7000000000000002,0.5,0.5,2,0,6.7000000000000002"
Created on 2022-02-12 by the reprex package (v2.0.1)
CodePudding user response:
When trying to read part of the file (around 100k rows) I got:
Warning message:
In fread("trip_data.csv", skip = i * 500, nrows = 500) :
Stopped early on line 2958. Expected 18 fields but found 19. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<CMT,2010-03-07 03:46:42,2010-03-07 03:58:31,1,3.6000000000000001,-73.961027000000001,40.796674000000003,1,,,-73.937324000000004,40.839283000000002,Cas,10.9,0.5,0.5,0,0,11.9>>
>
After removing it I was able to read at least 100 k rows
da = data.table(check.names = FALSE)
for (i in 0:200) {
print(i*500)
dt = fread("trip_data.csv", skip = i*500, nrows = 500, fill = TRUE)
da <- rbind(da, dt, use.names = FALSE)
}
str(da)
Classes ‘data.table’ and 'data.frame': 101000 obs. of 18 variables:
$ vendor_id : chr "" "CMT" "CMT" "CMT" ...
$ pickup_datetime : POSIXct, format: NA "2010-03-22 17:05:03" "2010-03-22 19:24:29" ...
$ dropoff_datetime : POSIXct, format: NA "2010-03-22 17:22:51" "2010-03-22 19:40:13" ...
$ passenger_count : int NA 1 1 1 3 1 1 1 1 1 ...
[...]
Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?
Then you can read it line by line, checking length of the list, and binding it to data table.
Regards, Grzegorz