Read in rows from a CSV using data.table's fread BY GROUP-CodePudding

I have a very large CSV file containing millions and millions of rows, it is a combined dataset from multiple sources. I have a character column/variable that indicates the source of the row/observation, and I'd like to read in a subset of rows from each group using data.table's fread.

Right now I'm trying to read in 20,000 rows to test out my code but the first few million rows are all from one source so I can't get rows from multiple sources. Is there a way to say read in 20,000 rows from each SOURCE level? So basically reading in 20,000 rows from each BY GROUP (source)?

CodePudding user response：

One potential solution is to use the vroom package to read in only the "group" column, i.e.

library(tidyverse)
library(vroom)
df <- vroom("file.csv", col_select = group)

Then sample the rownumbers, grouped by group, in 'blocks':

df2 <- df %>%
  mutate(rownumber = row_number()) %>%
  group_by(group) %>% 
  slice_head(n = 20000) %>%
  ungroup() %>%
  select(rownumber)

Then use this approach to select the rows in each block (one call to fread per block) and rbind them together, i.e.

library(data.table)
s <- df2$rownumber
v <- (1:1e5 %in% s)
idx  <- c(0, cumsum(seq$lengths))[which(seq$values)]   1
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

I don't have any huge files to test this, but if this approach doesn't work for you I can help troubleshoot. It might also be worth looking at an external program (e.g. AWK) as a workaround if your dataset is too large.

CodePudding user response：

I wonder if doing something outside of R might be better. Using awk for instance, using this data in awkgroup.csv:

"something","group"
1,"A"
2,"A"
3,"A"
4,"A"
5,"A"
6,"B"
7,"B"
8,"B"
9,"B"
10,"B"
11,"C"
12,"C"
13,"C"
14,"C"
15,"C"
16,"D"
17,"D"
18,"D"
19,"D"
20,"D"

We can do

$ awk -F, '$2==inp{line  ;if(line<3)print($0)}; $2!=inp{inp=$2;line=0;print($0)};' awkgroup.csv > newdata.csv
$ cat newdata.csv
"something","group"
1,"A"
2,"A"
3,"A"
6,"B"
7,"B"
8,"B"
11,"C"
12,"C"
13,"C"
16,"D"
17,"D"
18,"D"

Basic walk-through (though I do not consider myself an awk wizard by any account):

$2==inp (and similarly $2!=inp) tests if the second column (our grouping variable) has changed since the previous row. The inp is initially uninitialized so will default to the empty string.

NB: this assumes the data is ordered by-group.
line ;if(line<3)print($0) is the majority of work, where it tests if line (our method of tracking lines within the current group) is less than 3 (we use a 0-based line here) and prints if so. This gives us the top 3 lines per group.
inp=$2;line=0;print($0) is similar, but is run on the first line within a group; it resets the line counter, sets what we think the current group is by assigning to inp, and always prints (since this is the first line within a group).

I couldn't get this to work well in fread(cmd="..."), likely because it (for some reason) uses shell in lieu of the admittedly-also-broken (but not here) system. One can use system and control where the output goes, then read that normally, perhaps

system2("awk", c("-F, '$2==inp{line  ;if(line<3)print($0);};$2!=inp{inp=$2;line=0;print($0)};'",
                 "awkgroup.csv"),
        stdout="awkout.csv" )
fread("awkout.csv")
#     something  group
#         <int> <char>
#  1:         1      A
#  2:         2      A
#  3:         3      A
#  4:         6      B
#  5:         7      B
#  6:         8      B
#  7:        11      C
#  8:        12      C
#  9:        13      C
# 10:        16      D
# 11:        17      D
# 12:        18      D

FYI, system2 is not really any better than system: it just concatenates the quoted (good!) command= with all unquoted args (bad!):

  command <- paste(c(shQuote(command), env, args), collapse = " ")

which is why I'm able to cheat a little and combine all of awk's args= into a vector.

From here, you need to control two things about this:

change $2 (in three places) to the column number of your grouping variable;
change < 3 to be whatever limit you want (recall that it is 0-based, so < 3 gives you 3 entries, not 2).