Find groups that have the same values in another two columns-CodePudding

I'd like to find the rows that group A and group B have the same values in the feature column and the value column.

My data:

  group feature value
1     A      f1     1
2     B      f1     1
3     A      f2     1
4     B      f2     0
5     A      f3     0
6     B      f3     1

df = data.frame(group = rep(c("A", "B"), 3), 
                feature = c("f1", "f1", "f2", "f2", "f3", "f3"),
                value = c(1,1,1,0,0,1))

Desired outputs:

Shared:

  group feature value
1     A      f1     1
2     B      f1     1

Group-specific:

  group feature value
1     A      f2     1
2     B      f2     0
3     A      f3     0
4     B      f3     1

I've got the results by doing it manually:

split the dataframe by the group column --> get two dataframe, one only has group A and another only has group B.
create a new column feature_value using paste(df$feature, df$value, sep = "_").
use intersect and setdiff to find the shared and group-specific feature_value.
filter the orginal df using the extracted feature_value.

This is laborious and I hope there is a better way.

CodePudding user response：

Solution 1: Grouped Filter

Group by feature, then filter:

library(dplyr)

df %>% 
  group_by(feature) %>%
  filter(n_distinct(value) == 1) %>%
  ungroup()

# A tibble: 2 × 3
  group feature value
  <chr> <chr>   <dbl>
1 A     f1          1
2 B     f1          1

Same code, but with != instead of ==:

# A tibble: 4 × 3
  feature group value
  <chr>   <chr> <dbl>
1 f2      A         1
2 f2      B         0
3 f3      A         0
4 f3      B         1

Solution 2: Pivoting

A bit more cumbersome, but may be more flexible e.g. if you want to make comparisons among >2 features: pivot wider, then filter, then pivot back to long:

library(dplyr)
library(tidyr)

df_wide <- df %>% 
  pivot_wider(names_from = group)
  
shared <- df_wide %>%
  filter(A == B) %>%
  pivot_longer(A:B, names_to = "group")

group_specific <- df_wide %>%
  filter(A != B) %>%
  pivot_longer(A:B, names_to = "group")

CodePudding user response：

I think this is what are you looking for?

For the first desired output:

library(data.table)
setDT(df)

df[, .SD[.I[.N > 1]], by = .(feature, value)]

   feature value group
1:      f1     1     A
2:      f1     1     B

And, for the second desired output, we could do:

df[, .(group = df$group[.I[.N == 1]]), by = .(feature, value)]

   feature value group
1:      f2     1     A
2:      f2     0     B
3:      f3     0     A
4:      f3     1     B