Home > Enterprise >  Simple fread operation with fill=TRUE fails
Simple fread operation with fill=TRUE fails

Time:04-30

The following code generates data files where each row has a different number of columns. The option fill=TRUE appears to work only when a certain character limit is reached. For instance compare lines 1-3 with lines 9-11, noting that both of these examples work as expected. How can I read the entirety of notworking1.dat with fill=TRUE enabled and not just the first 100 rows?

for (i in seq(1000,1099,by=1)) 
    cat(file="working1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working1.dat", fill=TRUE)

for (i in seq(1000,1101,by=1)) 
    cat(file="notworking1.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "notworking1.dat", fill=TRUE)

for (i in seq(1,101,by=1)) 
    cat(file="working2.dat", c(1:i, "\n"), append = TRUE)
df <- fread(input = "working2.dat", fill=TRUE)

The following solution will also fail

df <- fread(input = "notworking1.dat", fill=TRUE, col.names=paste0("V", seq_len(1101)))

Warning Message received:

Warning message: In data.table::fread(input = "notworking1.dat", fill = TRUE) : Stopped early on line 101. Expected 1099 fields but found 1100. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1 2 3 4 ...

CodePudding user response:

We could find out maximum number of columns and add that many columns, then fread:

x <- readLines("notworking1.dat")
myHeader <- paste(paste0("V", seq(max(lengths(strsplit(x, " ", fixed = TRUE))))), collapse = " ")

# write with headers
write(myHeader, "tmp_file.txt")
write(x, "tmp_file.txt", append = TRUE)
# read as usual with fill
d1 <- fread("tmp_file.txt", fill = TRUE)

# check output
dim(d1)
# [1]  102 1101
d1[100:102, 1101]
#    V1101
# 1:    NA
# 2:    NA
# 3:  1101

But as we already have the data imported with readLines, we could just parse it:

x <- readLines("notworking1.dat")
xSplit <- strsplit(x, " ", fixed = TRUE)

# rowbind unequal length list, and convert to data.table
d2 <- data.table(t(sapply(xSplit, '[', seq(max(lengths(xSplit))))))

# check output
dim(d2)
# [1]  102 1101
d2[100:102, 1101]
#    V1101
# 1:  <NA>
# 2:  <NA>
# 3:  1101

It is a known issue GitHub issue 5119, not implemented but it is suggested fill will take integer as input, too. So the solution would be something like:

d <- fread(input = "notworking1.dat", fill = 1101)
  • Related