Home > Software engineering >  How to check whether all values in grouped columns are the same?
How to check whether all values in grouped columns are the same?

Time:02-25

How do I check whether all the values in grouped columns are the same?

For example, I have the following df:

   id category yes
1   1       in   1
2   1       in   1
3   1       in   1
4   1       in   1
5   1       in   1
6   1      out   1
7   1      out   1
8   1      out   1
9   2       in   1
10  2       in   1
11  2      out   0
12  2      out   1
13  2      out   1
14  3       in   1
15  3       in   1
16  3       in   0
17  3      out   1
18  3      out   1
19  4       in   1
20  4       in   1
21  4       in   1
22  4      out   1
23  4      out   0

I want to do something like this:

df <- df %>%
  group_by(id, category) %>%
  mutate(
    out = ifelse(# id, category, and yes have the same values in each row within the group)
  )

So the expected output will look like this:

   id category yes same
1   1       in   1    1
2   1       in   1    1
3   1       in   1    1
4   1       in   1    1
5   1       in   1    1
6   1      out   1    1
7   1      out   1    1
8   1      out   1    1
9   2       in   1    1
10  2       in   1    1
11  2      out   0    0
12  2      out   1    0
13  2      out   1    0
14  3       in   1    0
15  3       in   1    0
16  3       in   0    0
17  3      out   1    1
18  3      out   1    1
19  4       in   1    1
20  4       in   1    1
21  4       in   1    1
22  4      out   1    0
23  4      out   0    0

Rows 11-13 have the same "id" and "category" but the "yes" column has different values. Thus, the "same" column should be marked 0 (because they're not the same). Same with rows 14-16 and rows 22-23.

Here's the reproducible code for the df:

structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), category = c("in", 
"in", "in", "in", "in", "out", "out", "out", "in", "in", "out", 
"out", "out", "in", "in", "in", "out", "out", "in", "in", "in", 
"out", "out"), yes = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -23L))

Any guidance would be appreciated!

CodePudding user response:

We may use n_distinct to check for frequency of unique elements in the group, convert to logical (== 1) and then to binary with as.integer or

library(dplyr)
df %>%
  group_by(id, category) %>% 
  mutate(same =  (n_distinct(yes) == 1)) %>% 
  ungroup

Or using data.table

library(data.table)
setDT(df)[, same :=  (uniqueN(yes) == 1), by = .(id, category)]

CodePudding user response:

One option if your data is a data.table.

I don't think this is a good solution in situations like this example where you only have one column to check. In that case you can use uniqueN like @akrun shows. If I'm remembering correctly, when you have many rows and many columns in your cols argument to the below function, this is faster because to use uniqueN in that case you'd have to first create a data.table from the cols you want to check, once per each group (uniqueN input must be vector or data.table). But, I don't have any benchmark so maybe I'm misremembering.

library(data.table)
setDT(df)

check_single_value <- function(df, col_uq, by, show_groups = FALSE) {
  n_uq <- unique(df[, c(col_uq, by), with = FALSE])[, .N, by = by]
  if (show_groups)
    n_uq[N > 1]
  else 
    n_uq[, !any(N > 1)]
}

check_single_value(df, 'yes', by = c('id', 'category'))
#> [1] FALSE

check_single_value(df, 'yes', by = c('id', 'category'), show_groups = T)
#>       id category     N
#>    <int>   <char> <int>
#> 1:     2      out     2
#> 2:     3       in     2
#> 3:     4      out     2

Created on 2022-02-24 by the reprex package (v2.0.1)

  • Related