Home > Mobile >  R Data.Table Keep All Rows of First Occurence per Group
R Data.Table Keep All Rows of First Occurence per Group

Time:10-29

I have a data.table with groups and values. I want to keep all entries that are within a group's first occurence (from top of the table to the bottom).

Example:

set.seed(666)
group = c(1,1,1,2,2,3,3,3,1,1,4,4,4,1,1,2)
value = runif(16)
DT = data.table(group,value)

 > DT
 group      value
 1:     1 0.77436849
 2:     1 0.19722419
 3:     1 0.97801384
 4:     2 0.20132735
 5:     2 0.36124443
 6:     3 0.74261194
 7:     3 0.97872844
 8:     3 0.49811371
 9:     1 0.01331584
10:     1 0.25994613
11:     4 0.77589308
12:     4 0.01637905
13:     4 0.09574478
14:     1 0.14216354
15:     1 0.21112624
16:     2 0.81125644

What I want to achieve (row 9, 10, 14, 15 and 16 being removed as group 1 and 2 appeared before already):

 > DT
 group      value
 1:     1 0.77436849
 2:     1 0.19722419
 3:     1 0.97801384
 4:     2 0.20132735
 5:     2 0.36124443
 6:     3 0.74261194
 7:     3 0.97872844
 8:     3 0.49811371
11:     4 0.77589308
12:     4 0.01637905
13:     4 0.09574478

I've figured that DT[,.SD[1], by = "group", .SDcols = "value"] gives me the first entry per group but I want all entries until the groups change (e.g. the first three entries of group 1 in this particular case).

I thought about using something like DT[,.I, by = group] which gives me the row indices per entry sorted by groups but I have absolutely no idea how to elegantly identify such "breaks" within a data.table expression.

Edit: I find the maximum row number per group like this but don't know how to continue from here:

maxRow = setnames(DT[, na.omit(which(diff(.I) > 1)[1]), by = "group"], "V1", "maxRow")

CodePudding user response:

One possible solution:

DT[,.(group,value,rleid=rleid(group))][,.SD[rleid==min(rleid),.(value)],by=group]
    group      value
    <num>      <num>
 1:     1 0.77436849
 2:     1 0.19722419
 3:     1 0.97801384
 4:     2 0.20132735
 5:     2 0.36124443
 6:     2 0.74261194
 7:     3 0.97872844
 8:     3 0.49811371
 9:     3 0.01331584
10:     4 0.01637905
11:     4 0.09574478
12:     4 0.14216354
  • Related