Home > front end >  Comparing multiple columns to equality
Comparing multiple columns to equality

Time:05-11

This may be a very elementary question using dplyr and tidyverse tools, but I couldn't find a good way to do it.

Let's suppose I have a data frame in a wide format, and I want to select rows so that a subset of columns has all the same value. Naively, I can do the following:

> df <- tribble(
    ~name, ~id,  ~cost, ~value1 , ~value2, ~value3,
    "a",     1,     10,       1,        1,       1,
    "a",     2,     20,       1,        2,       1,
    "b",     3,     50,       1,        1,       3,
    "b",     4,     45,       1,        1,       1,
    "b",     5,     70,       2,        2,       2
)


> df %>% select(
    value1 == value2 &
    value1 == value3 &
    value2 == value
)

# A tibble: 3 × 6
  name     id  cost value1 value2 value3
  <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 a         1    10      1      1      1
2 b         4    45      1      1      1
3 b         5    70      2      2      2

Now, let's suppose the number of the columns to be compared is very large (> 10). All columns start with value, so that we may have value_something, value_otherthing, value_morething, i.e., not necessarily numeric as in this example. However, if the number of columns is n, naively I have to create n * (n - 1) / 2 comparisons, which is clearly unmanageable.

Is there something like

df %>% filter(all_same(starts_with("value")))

where all_same()compares all selected columns by starts_with() (or any other selector)?

rowwise() and across() didn't help me too much either.

CodePudding user response:

We may use if_all to loop over the columns from 'value2' to 'value3', check if the column values are equal with value1, if_all returns TRUE only for a row where all the column comparisons are TRUE

library(dplyr)
df %>%
    filter(if_all(value2:value3, ~ value1 == .x))

-output

# A tibble: 3 × 6
  name     id  cost value1 value2 value3
  <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 a         1    10      1      1      1
2 b         4    45      1      1      1
3 b         5    70      2      2      2

Or if we want to use starts_with

df %>%
    filter(if_all(starts_with('value'), ~ value1 == .x))

CodePudding user response:

Here's a possible base R option, where we can count the number of unique values to see if there is only 1 for each row (and just for the "value" columns).

df[apply(df[, -c(1:3)], 1, function(x) length(unique(x)) == 1), ]

Output

  name     id  cost value1 value2 value3
  <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1 a         1    10      1      1      1
2 b         4    45      1      1      1
3 b         5    70      2      2      2
  • Related