Consider these two data.tables, foo
and bar
.
foo <- data.table(id = c(1,2,3,4), f1 = c("a", "b", "c", "d"), f2 = c("a", "b", "c", "d"))
bar <- data.table(id = c(1,2,3,4), f1 = c("a", "a", "c", "d"), f2 = c("a", "b", "c", "e"))
foo
id f1 f2
1: 1 a a
2: 2 b b
3: 3 c c
4: 4 d d
bar
id f1 f2
1: 1 a a
2: 2 a b
3: 3 c c
4: 4 d e
I know that foo
and bar
have a 1-1 relationship.
I would like to select rows from bar
such that the corresponding row in foo
has different values. For example,
- id 1: the values of
f1
andf2
are the same infoo
and bar, so exclude this one - id 2: the value of
f1
has changed! include this in the result - id 3: the values of
f1
andf2
are the same infoo
and bar, so exclude this one - id 4: the value of
f2
has changed! include this in the result
Expected Result
bar[c(2,4)]
id f1 f2
1: 2 a b
2: 4 d e
What I tried
I thought a non-equi join would work great here.. Unfortunately, it seems the "not equals" operator isn't supported.?
foo[!bar, on = c("id=id", "f1!=f1", "f2!=f2")]
# Invalid operators !=,!=. Only allowed operators are ==<=<>=>.
foo[!bar, on = c("id=id", "f1<>f1", "f2<>f2")]
# Found more than one operator in one 'on' statement: f1<>f1. Please specify a single operator.
CodePudding user response:
As other than data.table
solution are also welcomed. Here is a tidyverse
solution:
library(dplyr)
library(tidyr)
left_join(foo, bar, by="id") %>%
group_by(id) %>%
mutate(identical = n_distinct(unlist(cur_data())) == 1) %>%
filter(identical == FALSE) %>%
select(id, f1=f1.y, f2=f2.y)
Groups: id [2]
id f1 f2
<dbl> <chr> <chr>
1 2 a b
2 4 d e
CodePudding user response:
With data.table
:
bar[foo,.SD[i.f1!=x.f1|i.f2!=x.f2],on="id"]
id f1 f2
<num> <char> <char>
1: 2 a b
2: 4 d e
CodePudding user response:
Benchmarking on a larger dataset with a few different data.table
options. The mapply
option works only if all.equal(foo$id, bar$id)
.
library(data.table)
foo <- data.table(id = 1:1e5, f1 = 1:1e5, f2 = 1:1e5)
mix <- sample(1e5, 5e4)
bar <- foo[mix[1:25e3], f1 := 0L][mix[25001:5e4], f2 := 0L]
microbenchmark::microbenchmark(join = bar[foo,.SD[i.f1!=x.f1|i.f2!=x.f2],on="id"],
fsetdiff = fsetdiff(bar, foo),
duplicated = bar[(!duplicated(rbindlist(list(bar, foo)), fromLast = TRUE))[1:nrow(bar)]],
mapply = bar[rowSums(mapply(function(i) foo[[i]] != bar[[i]], 2:length(bar))) > 0,],
check = "equal")
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> join 12.3091 13.079300 14.456793 13.645051 14.361851 24.5162 100
#> fsetdiff 16.4108 17.727301 19.693330 18.343152 20.119901 32.4459 100
#> duplicated 17.0628 18.669351 21.215564 19.841400 21.820051 62.8362 100
#> mapply 3.1556 3.548052 4.675822 3.741052 4.141851 12.3819 100
CodePudding user response:
I think this is best (cleanest, but perhaps not fastest?):
bar[!foo, on=.(id,f1,f2)]
id f1 f2
<num> <char> <char>
1: 2 a b
2: 4 d e