I am trying to find the duplicates, but based on a grouping. The grouping variable I want to use is called MRN (i.e. BMIdf$MRN). In other words, I want to find the duplicates, but only if it is a duplicate for the specific MRN id. I am not sure how to incorporate that grouping into my syntax. Here is what I have so far.
BMIdf$dupobs<-ifelse(((duplicated(BMIdf$OBSERVATION_DATE))|
(duplicated(BMIdf$OBSERVATION_DATE,fromLast = TRUE))),TRUE,FALSE)
How can I return TRUE only if it is a duplicate for a given MRN id? Open to non-data.table methods
Here is some sample data:
sample <- data.frame(MRN = c(1, 2, 1, 2, 3, 4, 3),
OBSERVATION_DATE = anydate(c("2013-02-19", "2013-02-28", "2013-02-19", "2013-02-28", "2013-02-28", "2013-03-08", "2014-01-06")))
So I want it to recognize the 2nd and 4th dates in the vector as duplicates. But not the 5th. As the 5th has a different MRN id
CodePudding user response:
data.table
library(data.table)
as.data.table(sample)[, dupobs := any(duplicated(.SD)), by = MRN][]
# MRN OBSERVATION_DATE dupobs
# <num> <Date> <lgcl>
# 1: 1 2013-02-19 TRUE
# 2: 2 2013-02-28 TRUE
# 3: 1 2013-02-19 TRUE
# 4: 2 2013-02-28 TRUE
# 5: 3 2013-02-28 FALSE
# 6: 4 2013-03-08 FALSE
# 7: 3 2014-01-06 FALSE
dplyr
library(dplyr)
sample %>%
group_by(MRN) %>%
mutate(dupobs = any(duplicated(OBSERVATION_DATE))) %>%
ungroup()
# # A tibble: 7 x 3
# MRN OBSERVATION_DATE dupobs
# <dbl> <date> <lgl>
# 1 1 2013-02-19 TRUE
# 2 2 2013-02-28 TRUE
# 3 1 2013-02-19 TRUE
# 4 2 2013-02-28 TRUE
# 5 3 2013-02-28 FALSE
# 6 4 2013-03-08 FALSE
# 7 3 2014-01-06 FALSE
base R
sample$dupobs <- ave(as.integer(sample$OBSERVATION_DATE), sample$MRN,
FUN = function(z) any(duplicated(z))) > 0
sample
# MRN OBSERVATION_DATE dupobs
# 1 1 2013-02-19 TRUE
# 2 2 2013-02-28 TRUE
# 3 1 2013-02-19 TRUE
# 4 2 2013-02-28 TRUE
# 5 3 2013-02-28 FALSE
# 6 4 2013-03-08 FALSE
# 7 3 2014-01-06 FALSE
With ave
, the first argument's class is used for the output, which can be rather inconvenient. For this, I cast to integer (so that the function will work without error); the inner function will initially create a logical
, but ave
converts it to the integer
(of the original vector), which translates false to 0 and true to 1. From there, I compare the output (0s and 1s) against 0 to see if it was true. Minor inconvenience.
Data
sample <- structure(list(MRN = c(1, 2, 1, 2, 3, 4, 3), OBSERVATION_DATE = structure(c(15755, 15764, 15755, 15764, 15764, 15772, 16076), class = "Date")), class = "data.frame", row.names = c(NA, -7L))
CodePudding user response:
It will probably be more efficient to count by group:
sampleDT[, n := .N, by=.(MRN, OBSERVATION_DATE)]
sampleDT[, flag := n == 1L]
MRN OBSERVATION_DATE n flag
<num> <IDat> <int> <lgcl>
1: 1 2013-02-19 2 FALSE
2: 2 2013-02-28 2 FALSE
3: 1 2013-02-19 2 FALSE
4: 2 2013-02-28 2 FALSE
5: 3 2013-02-28 1 TRUE
6: 4 2013-03-08 1 TRUE
7: 3 2014-01-06 1 TRUE
Calculation of .N
by group is optimized (see ?GForce
), though I think it might not be enabled for :=
yet.
Input:
sampleDT <- data.table(
MRN = c(1, 2, 1, 2, 3, 4, 3),
OBSERVATION_DATE = as.IDate(c("2013-02-19", "2013-02-28", "2013-02-19",
"2013-02-28", "2013-02-28", "2013-03-08", "2014-01-06"))
)