Home > Software design >  How to find elements with not enough observations in a list
How to find elements with not enough observations in a list

Time:09-14

Say I have the following list where each element is a data.frame of different sizes

df1 <- data.frame(matrix(rnorm(12346), ncol = 2))
df2 <- data.frame(matrix(rnorm(14330), ncol = 2))
df3 <- data.frame(matrix(rnorm(2422), ncol = 2))

l <- list(df1, df2, df3)

In my example each data.frame represents a year of observations, and clearly df3 contains a lot fewer observations compared to the other two.

My question is then: What is the best approach to detect those elements of the list l that does not compare in the number of rows and then remove them from the list?

I've so far tried using the median but as this should always remove half of the elements in l I'm not sure this is the best solution for future use

library(collapse)
cutoff <- input %>%
      vapply(nrow, FUN.VALUE = length(.) %>%
      median()
  
idx <- dapply(X = input, FUN = function(x) nrow(x) >= cutoff)
    
input[idx]

where input is a list as the above l

NOTE: As this is my first question on SO, please feel free to edit the question if it does not live up the standards of this community or give feedback on asking better questions. Thanks in advance

EDIT: The question is not so much on how to use median to remove elements of the list, but rather IF median is the right method to remove those data.frames which have a lot less observations than the others

CodePudding user response:

Does this work:

l[sapply(l, function(x) nrow(x) >= median(unlist(lapply(l, nrow))))]

CodePudding user response:

purrr::keep is the way to go when filtering lists with conditions.

library(purrr)

keep(l, ~ nrow(.x) > median(map_dbl(l, nrow)))
  • Related