I have a data.table with groups and values. I want to keep all entries that are within a group's first occurence (from top of the table to the bottom).
Example:
set.seed(666)
group = c(1,1,1,2,2,3,3,3,1,1,4,4,4,1,1,2)
value = runif(16)
DT = data.table(group,value)
> DT
group value
1: 1 0.77436849
2: 1 0.19722419
3: 1 0.97801384
4: 2 0.20132735
5: 2 0.36124443
6: 3 0.74261194
7: 3 0.97872844
8: 3 0.49811371
9: 1 0.01331584
10: 1 0.25994613
11: 4 0.77589308
12: 4 0.01637905
13: 4 0.09574478
14: 1 0.14216354
15: 1 0.21112624
16: 2 0.81125644
What I want to achieve (row 9, 10, 14, 15 and 16 being removed as group 1 and 2 appeared before already):
> DT
group value
1: 1 0.77436849
2: 1 0.19722419
3: 1 0.97801384
4: 2 0.20132735
5: 2 0.36124443
6: 3 0.74261194
7: 3 0.97872844
8: 3 0.49811371
11: 4 0.77589308
12: 4 0.01637905
13: 4 0.09574478
I've figured that DT[,.SD[1], by = "group", .SDcols = "value"]
gives me the first entry per group but I want all entries until the groups change (e.g. the first three entries of group 1 in this particular case).
I thought about using something like DT[,.I, by = group]
which gives me the row indices per entry sorted by groups but I have absolutely no idea how to elegantly identify such "breaks" within a data.table expression.
Edit: I find the maximum row number per group like this but don't know how to continue from here:
maxRow = setnames(DT[, na.omit(which(diff(.I) > 1)[1]), by = "group"], "V1", "maxRow")
CodePudding user response:
One possible solution:
DT[,.(group,value,rleid=rleid(group))][,.SD[rleid==min(rleid),.(value)],by=group]
group value
<num> <num>
1: 1 0.77436849
2: 1 0.19722419
3: 1 0.97801384
4: 2 0.20132735
5: 2 0.36124443
6: 2 0.74261194
7: 3 0.97872844
8: 3 0.49811371
9: 3 0.01331584
10: 4 0.01637905
11: 4 0.09574478
12: 4 0.14216354