How can I select rows from table A whose id matches those from table B, but whose (non id) values ar-CodePudding

Consider these two data.tables, foo and bar.

foo <- data.table(id = c(1,2,3,4), f1 = c("a", "b", "c", "d"), f2 = c("a", "b", "c", "d"))
bar <- data.table(id = c(1,2,3,4), f1 = c("a", "a", "c", "d"), f2 = c("a", "b", "c", "e"))

foo
   id f1 f2
1:  1  a  a
2:  2  b  b
3:  3  c  c
4:  4  d  d

bar
   id f1 f2
1:  1  a  a
2:  2  a  b
3:  3  c  c
4:  4  d  e

I know that foo and bar have a 1-1 relationship.

I would like to select rows from bar such that the corresponding row in foo has different values. For example,

id 1: the values of f1 and f2 are the same in foo and bar, so exclude this one
id 2: the value of f1 has changed! include this in the result
id 3: the values of f1 and f2 are the same in foo and bar, so exclude this one
id 4: the value of f2 has changed! include this in the result

Expected Result

bar[c(2,4)]
   id f1 f2
1:  2  a  b
2:  4  d  e

What I tried

I thought a non-equi join would work great here.. Unfortunately, it seems the "not equals" operator isn't supported.?

foo[!bar, on = c("id=id", "f1!=f1", "f2!=f2")]
# Invalid operators !=,!=. Only allowed operators are ==<=<>=>.

foo[!bar, on = c("id=id", "f1<>f1", "f2<>f2")]
# Found more than one operator in one 'on' statement: f1<>f1. Please specify a single operator.

CodePudding user response：

As other than data.table solution are also welcomed. Here is a tidyverse solution:

library(dplyr)
library(tidyr)

left_join(foo, bar, by="id") %>%
  group_by(id) %>% 
  mutate(identical = n_distinct(unlist(cur_data())) == 1) %>% 
  filter(identical == FALSE) %>% 
  select(id, f1=f1.y, f2=f2.y)

 Groups:   id [2]
     id f1    f2   
  <dbl> <chr> <chr>
1     2 a     b    
2     4 d     e

CodePudding user response：

With data.table:

bar[foo,.SD[i.f1!=x.f1|i.f2!=x.f2],on="id"]

      id     f1     f2
   <num> <char> <char>
1:     2      a      b
2:     4      d      e

CodePudding user response：

Benchmarking on a larger dataset with a few different data.table options. The mapply option works only if all.equal(foo$id, bar$id).

library(data.table)

foo <- data.table(id = 1:1e5, f1 = 1:1e5, f2 = 1:1e5)
mix <- sample(1e5, 5e4)
bar <- foo[mix[1:25e3], f1 := 0L][mix[25001:5e4], f2 := 0L]

microbenchmark::microbenchmark(join = bar[foo,.SD[i.f1!=x.f1|i.f2!=x.f2],on="id"],
                               fsetdiff = fsetdiff(bar, foo),
                               duplicated = bar[(!duplicated(rbindlist(list(bar, foo)), fromLast = TRUE))[1:nrow(bar)]],
                               mapply = bar[rowSums(mapply(function(i) foo[[i]] != bar[[i]], 2:length(bar))) > 0,],
                               check = "equal")
#> Unit: milliseconds
#>        expr     min        lq      mean    median        uq     max neval
#>        join 12.3091 13.079300 14.456793 13.645051 14.361851 24.5162   100
#>    fsetdiff 16.4108 17.727301 19.693330 18.343152 20.119901 32.4459   100
#>  duplicated 17.0628 18.669351 21.215564 19.841400 21.820051 62.8362   100
#>      mapply  3.1556  3.548052  4.675822  3.741052  4.141851 12.3819   100

CodePudding user response：

I think this is best (cleanest, but perhaps not fastest?):

bar[!foo, on=.(id,f1,f2)]

      id     f1     f2
   <num> <char> <char>
1:     2      a      b
2:     4      d      e