Finding the grouping variable for which the unique values of two columns are not unique-CodePudding

In DATA below, I was wondering how to find the unique study_id in which variable var1 and var2 don't form a unique combination?

Obviously, study_id that consist of a single row (e.g.,Bagheri) always form unique and are to be ignored.

But how could we find unique study_id with multiple rows that don't form a unique combination?

My expected answer is:

> [1]"Alavinia" (because it has multiple rows but keeps repeating the combination `var1 == ext` & `var2 == write`)

> [2] "Busch" (because it has multiple rows but repeats the combination `var1 == ext` & `var2 == speak`)

m="
study_id  var1  var2
Alavinia  ext   write
Alavinia  ext   write
Alavinia  ext   write
Bagheri   ext   read
Bartz     ext   speak
Bartz     ext   listen
Bartz     ext   write
Bartz     ext   read
Busch     ext   gra
Busch     ext   read
Busch     ext   others
Busch     ext   speak
Busch     ext   speak
"

DATA <- read.table(text = m, h=T)

CodePudding user response：

Base R method:

subset(data.frame(table(DATA)), Freq > 1)[,'study_id']
[1] Busch    Alavinia

or even:

sub(" .*", '', names(which(table(do.call(paste, DATA))>1)))

[1] "Alavinia" "Busch"

CodePudding user response：

A possible solution, based on dplyr:

library(dplyr)

DATA %>% 
  add_count(study_id, var1, var2) %>% 
  distinct %>% 
  filter(n > 1) %>% 
  pull(study_id)

#> [1] "Alavinia" "Busch"

Or using a data.table approach:

library(data.table)

setDT(DATA)[, study_id[.N > 1], .(study_id, var1, var2)]$V1

#> [1] "Alavinia" "Busch"