Character string limit reached early using fread from data.table, CSV lines look fine-CodePudding

I'm reading a moderately big CSV with fread but it errors out with "R character strings are limited to 2^31-1 bytes". readLines works fine, however. Pinning down the faulty line (2965), I am not sure what's going on: doesn't seem longer than the next one, for example.

Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?

library(data.table)
options(timeout=10000)

download.file("https://s3.amazonaws.com/nyc-tlc/trip data/yellow_tripdata_2010-03.csv",
              destfile = "trip_data.csv", mode = "wb")

dt = fread("trip_data.csv")
#> Error in fread("trip_data.csv"): R character strings are limited to 2^31-1 bytes

lines = readLines("trip_data.csv")

dt2955 = fread("trip_data.csv", nrows = 2955)
#> Warning in fread("trip_data.csv", nrows = 2955): Previous fread() session was
#> not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
dt2956 = fread("trip_data.csv", nrows = 2956)
#> Error in fread("trip_data.csv", nrows = 2956): R character strings are limited to 2^31-1 bytes
lines[2955]
#> [1] "CMT,2010-03-07 18:37:05,2010-03-07 18:41:51,1,1,-73.984211000000002,40.743720000000003,1,0,-73.974515999999994,40.748331,Cre,4.9000000000000004,0,0.5,1.0800000000000001,0,6.4800000000000004"
lines[2956]
#> [1] "CMT,2010-03-07 22:59:01,2010-03-07 23:01:04,1,0.59999999999999998,-73.992887999999994,40.703017000000003,1,0,-73.992887999999994,40.703017000000003,Cre,3.7000000000000002,0.5,0.5,2,0,6.7000000000000002"

^{Created on 2022-02-12 by the reprex package (v2.0.1)}

CodePudding user response：

When trying to read part of the file (around 100k rows) I got:

Warning message:
In fread("trip_data.csv", skip = i * 500, nrows = 500) :
  Stopped early on line 2958. Expected 18 fields but found 19. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<CMT,2010-03-07 03:46:42,2010-03-07 03:58:31,1,3.6000000000000001,-73.961027000000001,40.796674000000003,1,,,-73.937324000000004,40.839283000000002,Cas,10.9,0.5,0.5,0,0,11.9>>
>

After removing it I was able to read at least 100 k rows

da = data.table(check.names = FALSE)

for (i in 0:200) {
  print(i*500)
  dt = fread("trip_data.csv", skip = i*500, nrows = 500, fill = TRUE)
  da <- rbind(da, dt, use.names = FALSE)
}

str(da)
Classes ‘data.table’ and 'data.frame':  101000 obs. of  18 variables:
 $ vendor_id         : chr  "" "CMT" "CMT" "CMT" ...
 $ pickup_datetime   : POSIXct, format: NA "2010-03-22 17:05:03" "2010-03-22 19:24:29" ...
 $ dropoff_datetime  : POSIXct, format: NA "2010-03-22 17:22:51" "2010-03-22 19:40:13" ...
 $ passenger_count   : int  NA 1 1 1 3 1 1 1 1 1 ...
[...]

Any thoughts about what is going on and how to overcome this, preferably checking the CSV in advance to avoid an fread error?

Then you can read it line by line, checking length of the list, and binding it to data table.

Regards, Grzegorz