Home > database >  How to filter entries from a list of dataframes based on multiple criteria in R
How to filter entries from a list of dataframes based on multiple criteria in R

Time:08-24

I would like to filter 'similar' entries from a list in R. The input list follows:

d1=data.frame(y1 = c('Mike'),
              y2 = c('Tsoutsa'),
              y3 = c('Apple'),
              time = 50)

d2=data.frame(y1 = c('Mike'),
              y2 = c('Tsoutsa'),
              y3 = c('Orange'),
              time = 160)

d3=data.frame(y1 = c('Mike'),
              y2 = c('Tsoutsa'),
              y3 = c('Lemon'),
              time = 100)

d4=data.frame(y1 = c('Anna'),
              y2 = c('Pournova'),
              y3 = c('Nikolief'),
              time = 30)

d5=data.frame(y1 = c('Anna'),
              y2 = c('Pournova'),
              y3 = c('Leeds'),
              y4 = c('York'),
              time = 80 )

d6=data.frame(y1 = c('Loulis'),
              y2 = c('City'),
              time = 200 )

d7=data.frame(y1 = c('Ann'),
              y2 = c('Klitor'),
              time = 200 )

input = list(d1, d2, d3, d4, d5, d6, d7)

In the output list I would like to keep only the dataframes that have unique y1 and y2 values. If there are dataframes with same y1 and y2 then keep the one with the highest time. Hence, the output list should be: output = list(d2, d5, d6, d7).

What is the most efficient way of doing this? Much appreciated!

CodePudding user response:

Using dplyr, you can bind the rows (dplyr::bind_rows, with .id argument to create an identifier), and for y1 and y2 group, you can select the one with max time, then extract the set of ids. Use that to index the input list

library(dplyr)

retain = as.numeric(bind_rows(input,.id = "d") %>% 
  group_by(y1,y2) %>% 
  slice_max(time) %>%
  pull(d))

output <- input[sort(retain)]

An approach using data.table is as follows:

library(data.table)

input[
  rbindlist(lapply(input, setDT), idcol="d", fill=T)[, .SD[time==max(time)], .(y1,y2)]$d
]

Output:

[[1]]
    y1      y2     y3 time
1 Mike Tsoutsa Orange  160

[[2]]
    y1       y2    y3   y4 time
1 Anna Pournova Leeds York   80

[[3]]
      y1   y2 time
1 Loulis City  200

[[4]]
   y1     y2 time
1 Ann Klitor  200

Update 8/23/22

An alternative data.table approach:

rbindlist(
  lapply(input, \(i) setDT(i)[, .(y1, y2, time)]), idcol = "d"
)[order(y1,y2,-time),id:=1:.N,.(y1,y2)][id==1]

Output:

       d     y1       y2  time    id
   <int> <char>   <char> <num> <int>
1:     2   Mike  Tsoutsa   160     1
2:     5   Anna Pournova    80     1
3:     6 Loulis     City   200     1
4:     7    Ann   Klitor   200     1
  • Related