The following code generates data files where each row has a different number of columns. The option fill=TRUE
appears to work only when a certain character limit is reached. For instance compare lines 1-3 with lines 9-11, noting that both of these examples work as expected. How can I read the entirety of notworking1.dat
with fill=TRUE
enabled and not just the first 100 rows?
for (i in seq(1000,1099,by=1))
cat(file="working1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working1.dat", fill=TRUE)
for (i in seq(1000,1101,by=1))
cat(file="notworking1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "notworking1.dat", fill=TRUE)
for (i in seq(1,101,by=1))
cat(file="working2.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working2.dat", fill=TRUE)
The following solution will also fail
df <- fread(input = "notworking1.dat", fill=TRUE, col.names=paste0("V", seq_len(1101)))
Warning Message received:
Warning message: In data.table::fread(input = "notworking1.dat", fill = TRUE) : Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3 4 ...
CodePudding user response:
We could find out maximum number of columns and add that many columns, then fread:
x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")
# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)
# check output
dim(d1)
# [1] 102 1101
d1[100:102, 1101]
# V1101
# 1: NA
# 2: NA
# 3: 1101
But as we already have the data imported with readLines, we could just parse it:
x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)
# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))
# check output
dim(d2)
# [1] 102 1101
d2[100:102, 1101]
# V1101
# 1: <NA>
# 2: <NA>
# 3: 1101
It is a known issue GitHub issue 5119, not implemented but it is suggested fill will take integer as input, too. So the solution would be something like:
d <- fread(input = "notworking1.dat", fill = 1101)