I would like to filter 'similar' entries from a list in R. The input list follows:
d1=data.frame(y1 = c('Mike'),
y2 = c('Tsoutsa'),
y3 = c('Apple'),
time = 50)
d2=data.frame(y1 = c('Mike'),
y2 = c('Tsoutsa'),
y3 = c('Orange'),
time = 160)
d3=data.frame(y1 = c('Mike'),
y2 = c('Tsoutsa'),
y3 = c('Lemon'),
time = 100)
d4=data.frame(y1 = c('Anna'),
y2 = c('Pournova'),
y3 = c('Nikolief'),
time = 30)
d5=data.frame(y1 = c('Anna'),
y2 = c('Pournova'),
y3 = c('Leeds'),
y4 = c('York'),
time = 80 )
d6=data.frame(y1 = c('Loulis'),
y2 = c('City'),
time = 200 )
d7=data.frame(y1 = c('Ann'),
y2 = c('Klitor'),
time = 200 )
input = list(d1, d2, d3, d4, d5, d6, d7)
In the output list I would like to keep only the dataframes that have unique y1
and y2
values. If there are dataframes with same y1
and y2
then keep the one with the highest time
.
Hence, the output list should be: output = list(d2, d5, d6, d7)
.
What is the most efficient way of doing this? Much appreciated!
CodePudding user response:
Using dplyr
, you can bind the rows (dplyr::bind_rows, with .id
argument to create an identifier), and for y1
and y2
group, you can select the one with max time, then extract the set of ids. Use that to index the input list
library(dplyr)
retain = as.numeric(bind_rows(input,.id = "d") %>%
group_by(y1,y2) %>%
slice_max(time) %>%
pull(d))
output <- input[sort(retain)]
An approach using data.table
is as follows:
library(data.table)
input[
rbindlist(lapply(input, setDT), idcol="d", fill=T)[, .SD[time==max(time)], .(y1,y2)]$d
]
Output:
[[1]]
y1 y2 y3 time
1 Mike Tsoutsa Orange 160
[[2]]
y1 y2 y3 y4 time
1 Anna Pournova Leeds York 80
[[3]]
y1 y2 time
1 Loulis City 200
[[4]]
y1 y2 time
1 Ann Klitor 200
Update 8/23/22
An alternative data.table approach:
rbindlist(
lapply(input, \(i) setDT(i)[, .(y1, y2, time)]), idcol = "d"
)[order(y1,y2,-time),id:=1:.N,.(y1,y2)][id==1]
Output:
d y1 y2 time id
<int> <char> <char> <num> <int>
1: 2 Mike Tsoutsa 160 1
2: 5 Anna Pournova 80 1
3: 6 Loulis City 200 1
4: 7 Ann Klitor 200 1