Home > other >  how to select categories that are in one dataset, but not in another in R
how to select categories that are in one dataset, but not in another in R

Time:12-31

There are 2 datasets

s=structure(list(var1 = c("a", "f", "k", "tt", "ee"), var2 = c("b", 
"g", "l", "qq", "rr"), var3 = c("c", "h", "m", "ff", "cc"), var4 = c("d", 
"i", "n", "gg", "vv"), var5 = c("e", "j", "o", "aa", "xx"), metric_var = c(100L, 
200L, 300L, 567L, 789L)), class = "data.frame", row.names = c(NA, 
-5L))

and

medagger=structure(list(var1 = c("a", "z", "w", "f", "k"), var2 = c("b", 
"u", "e", "g", "l"), var3 = c("c", "p", "r", "h", "m"), var4 = c("d", 
"q", "q", "i", "n"), var5 = c("e", "n", "w", "j", "o"), metric_var = c(100L, 
200L, 400L, 500L, 700L)), class = "data.frame", row.names = c(NA, 
-5L))

var1-var5 are categorical variables, their values are categories. if we do an inner join s and a medagger we get this result

merge(s,medagger,by=c("var1","var2","var3","var4","var5"))
  var1 var2 var3 var4 var5 metric_var.x metric_var.y
1    a    b    c    d    e          100          100
2    f    g    h    i    j          200          500
3    k    l    m    n    o          300          700

only 3 rows to return, because in these datasets there are the same categories. However, I need the categories that there are in the s dataset but not in the medagger dataset were put in separate dataframes. In this case, I need the dataframe new to appear, which will contain only 2 of these rows from s .

var1    var2    var3    var4    var5    metric_var
tt  qq  ff  gg  aa  567
ee  rr  cc  vv  xx  789

How to do that the categories that were in s but not in the medagger appear in the new dataframe ? Thank you.

CodePudding user response:

You need anti_join from dplyr library to get the expected result. According to the documentation:

anti_join() return all rows from x without a match in y.

documentation

library(dplyr)

df <- s %>%
  anti_join(medagger, by = c("var1","var2","var3","var4","var5"))

The output will be like this:

  var1 var2 var3 var4 var5 metric_var
1   tt   qq   ff   gg   aa        567
2   ee   rr   cc   vv   xx        789

CodePudding user response:

library(dplyr)
s %>% inner_join(medagger, by=c("var1","var2","var3","var4","var5")) -> result1

  var1 var2 var3 var4 var5 metric_var.x metric_var.y
1    a    b    c    d    e          100          100
2    f    g    h    i    j          200          500
3    k    l    m    n    o          300          700

and the second dataframe:

s %>% anti_join(medagger, by=c("var1","var2","var3","var4","var5")) -> result2
  var1 var2 var3 var4 var5 metric_var
1   tt   qq   ff   gg   aa        567
2   ee   rr   cc   vv   xx        789

In base R, this works assuming that medagger and s do not have NA's:

merge(s,medagger,by=c("var1","var2","var3","var4","var5"), all.x=T) -> result
result2  <- result[!complete.cases(result),]

CodePudding user response:

You could rbind each row and look if it's not duplicated.

s[apply(s, 1, \(x) tail(!duplicated(rbind(medagger[1:5], x[1:5])), 1)), ]
#   var1 var2 var3 var4 var5 metric_var
# 4   tt   qq   ff   gg   aa        567
# 5   ee   rr   cc   vv   xx        789
  • Related