I have a data set with with 600 variables and I would like to only keep the rows where the values in multiple columns are above a certain threshold, in my case >= 5,000,000.
For a single column I would filter like this:
df <- df %>%
filter(variable.2019.03 >= 5000000)
I could do it for multiple columns like this:
df <- df %>%
filter(variable.2019.03 >= 5000000 & variable.2019.04 >= 5000000 &
variable.2019.05 >= 5000000)
However, I want to do this for 30 columns so I was hoping there is a more elegant way. My understanding is that I should be able to do this using the filter() function in combination with if_all(), but I was only able to find example where these functions are used to filter character columns nor numeric ones. Similar to the examples here, I have tried:
filter(if_all(>= 5000000), variable.2019.03:variable.2022.03)
which only yielded this error message:
Error: unexpected '>=' in:
"df <- df %>%
filter(if_all(>="
Any suggestions on achieving the desired outcome?
CodePudding user response:
We need to specify the columns (.cols
) within the if_all
. By default it is everything()
, so we change to match only the columns that starts with (^
) 'variable' followed by a dot (.
), four digits (\\d{4}
)..., within the matches
(that uses regular expression to match the column names)
library(dplyr)
df %>%
filter(if_all(matches('^variable\\.\\d{4}\\.d{2}$'), ~ .x >= 5000000))
If the columns are adjacent, we can use :
df %>%
filter(if_all(variable.2019.03:variable.2022.03, ~ .x >= 5000000))