Let's say the below df
df <- data.table(id = c(1, 2, 2, 3)
, datee = as.Date(c('2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'))
); df
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 2 2022-01-02
4: 3 2022-01-03
and I wanted to keep only the non-duplicated rows
df[!duplicated(id, datee)]
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 3 2022-01-03
which worked.
However, with the below df_1
df_1 <- data.table(a = c(1,1,2)
, b = c(1,1,3)
); df_1
a b
1: 1 1
2: 1 1
3: 2 3
using the same method does not rid the duplicated rows
df_1[!duplicated(a, b)]
a b
1: 1 1
2: 1 1
3: 2 3
What am I doing wrong?
CodePudding user response:
Let's dive in to why your df_1[!duplicated(a, b)]
doesn't work.
duplicated
uses S3 method dispatch.
library(data.table)
.S3methods("duplicated")
# [1] duplicated.array duplicated.data.frame
# [3] duplicated.data.table* duplicated.default
# [5] duplicated.matrix duplicated.numeric_version
# [7] duplicated.POSIXlt duplicated.warnings
# see '?methods' for accessing help and source code
Looking at those, we aren't using duplicated.data.table
since we're calling it with individual vectors (it has no idea it is being called from within a data.table
context), so it makes sense to look into duplicated.default
.
> debugonce(duplicated.default)
> df_1[!duplicated(a, b)]
debugging in: duplicated.default(a, b)
debug: .Internal(duplicated(x, incomparables, fromLast, if (is.factor(x)) min(length(x),
nlevels(x) 1L) else nmax))
Browse[2]> match.call() # ~ "how this function was called"
duplicated.default(x = a, incomparables = b)
Confirming with ?duplicated
:
x: a vector or a data frame or an array or 'NULL'.
incomparables: a vector of values that cannot be compared. 'FALSE' is
a special value, meaning that all values can be compared, and
may be the only value accepted for methods other than the
default. It will be coerced internally to the same type as
'x'.
From this we can see that a
is being used for deduplication, and b
is used as "incomparable". Because b
contains the value 1
that is in a
and duplicated, then rows where a==1
are not tested for duplication.
To confirm, if we change b
such that it does not share (duplicated) values with a
, we see that the deduplication of a
works as intended (though it is silently ignoring b
's dupes due to the argument problem):
df_1 <- data.table(a = c(1,1,2) , b = c(2,2,4))
df_1[!duplicated(a, b)] # accidentally correct, `b` is not used
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_1, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
df_2 <- data.table(a = c(1,1,2) , b = c(2,3,4))
df_2[!duplicated(a, b)] # wrong, `b` is not considered
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_2, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 1 3
# 3: 2 4
(Note that unique
above is actually data.table:::unique.data.table
, another S3 method dispatch provided by the data.table
package.)
debug
and debugonce
are your friends :-)