Home > OS >  Filtering numbers above a certain threshold for multiple columns in R
Filtering numbers above a certain threshold for multiple columns in R

Time:04-11

I have a data set with with 600 variables and I would like to only keep the rows where the values in multiple columns are above a certain threshold, in my case >= 5,000,000.

For a single column I would filter like this:

df <- df %>%
  filter(variable.2019.03 >= 5000000)

I could do it for multiple columns like this:

df <- df %>%
  filter(variable.2019.03 >= 5000000 & variable.2019.04 >= 5000000 &
         variable.2019.05 >= 5000000)

However, I want to do this for 30 columns so I was hoping there is a more elegant way. My understanding is that I should be able to do this using the filter() function in combination with if_all(), but I was only able to find example where these functions are used to filter character columns nor numeric ones. Similar to the examples here, I have tried:

  filter(if_all(>= 5000000), variable.2019.03:variable.2022.03)

which only yielded this error message:

Error: unexpected '>=' in:
"df <- df %>%
  filter(if_all(>="

Any suggestions on achieving the desired outcome?

CodePudding user response:

We need to specify the columns (.cols) within the if_all. By default it is everything(), so we change to match only the columns that starts with (^) 'variable' followed by a dot (.), four digits (\\d{4})..., within the matches (that uses regular expression to match the column names)

library(dplyr)
df %>%
   filter(if_all(matches('^variable\\.\\d{4}\\.d{2}$'), ~ .x >= 5000000))

If the columns are adjacent, we can use :

df %>%
    filter(if_all(variable.2019.03:variable.2022.03, ~ .x >= 5000000))
  • Related