Home > other >  Filter on combinations of columns in dataframe (R)
Filter on combinations of columns in dataframe (R)

Time:10-08

I am trying to filter a dataframe in R using dplyr where I want to check on a combination of 2 columns. Lets say I have the following dataframe;

df <- data.frame(val1 = c("a","b","c","s","s","s"),
             val2 = c(10, 5, 2, 2, 10, 5))

resulting in the following df;

  val1 val2
1    a   10
2    b    5
3    c    2
4    s    2
5    s   10
6    s    5

I want to filter out all rows where val1 == 's' EXCEPT when val2 on the same row is 5 or 10. So the result should look like this;

  val1 val2
1    a   10
2    b    5
3    c    2
4    s   10
5    s    5

I want to use the filter function from dplyr because I am using it with other filters in a query on a oracle database. I tried the following lines, but they do not work for they either filter out all rows where val1 == "s" or all rows where val2 == 5 or val2 == 10 (so in this example, including the rows where val1 == "a" or val1 == "b";

filter(val1 == "s" & val2 == 10 | val2 == 5)
filter(val1 == "s" && val2 == 10 | val2 == 5)
filter(val1 == "s" & (val2 == 10 | val2 == 5))
filter(val1 == "s" && (val2 == 10 | val2 == 5))

I tried looking up similar questions but they do not seem to answer my problem. I make it work by combining val1 and val2 as val3, and then I filter out all rows where val3 == "s5" or val3=="s10", but as I said, I intend to use the filter in a query to a database. Therefore, I would prefer to filter as much as possible in the query itself because my actual dataframe consists of hundreds of thousands of rows and 50 columns.

I do not think it would be hard, but I just can't seem to get it. Any help will be greatly appreciated!

CodePudding user response:

This works:

df %>% filter(!(val1 == 's' & !val2 %in% c(10, 5)))

Or equivalently:

df %>% filter((val1 != 's' | val2 %in% c(10, 5)))

output:

  val1 val2
1    a   10
2    b    5
3    c    2
4    s   10
5    s    5

CodePudding user response:

I want to filter out all rows where val1 == 's' EXCEPT when val2 on the same row is 5 or 10.

filter's conditions are for the rows to keep, so I would re-frame this as "I want to keep rows where val1 != 's' or where val2 is 5 or 10":

And generally it's nicer to use %in% as in x %in% c(1, 2, 3) rather than x == 1 | x == 2 | x == 3.

df %>%
  filter(val1 != 's' | val2 %in% c(5, 10))
#   val1 val2
# 1    a   10
# 2    b    5
# 3    c    2
# 4    s   10
# 5    s    5
  • Related