Home > Net >  Trying to find all duplicates, but by group in R
Trying to find all duplicates, but by group in R

Time:12-08

I am trying to find the duplicates, but based on a grouping. The grouping variable I want to use is called MRN (i.e. BMIdf$MRN). In other words, I want to find the duplicates, but only if it is a duplicate for the specific MRN id. I am not sure how to incorporate that grouping into my syntax. Here is what I have so far.

BMIdf$dupobs<-ifelse(((duplicated(BMIdf$OBSERVATION_DATE))| 
(duplicated(BMIdf$OBSERVATION_DATE,fromLast = TRUE))),TRUE,FALSE)

How can I return TRUE only if it is a duplicate for a given MRN id? Open to non-data.table methods

Here is some sample data:

sample <- data.frame(MRN = c(1, 2, 1, 2, 3, 4, 3),
                     OBSERVATION_DATE = anydate(c("2013-02-19", "2013-02-28", "2013-02-19", "2013-02-28", "2013-02-28", "2013-03-08", "2014-01-06")))

So I want it to recognize the 2nd and 4th dates in the vector as duplicates. But not the 5th. As the 5th has a different MRN id

CodePudding user response:

data.table

library(data.table)
as.data.table(sample)[, dupobs := any(duplicated(.SD)), by = MRN][]
#      MRN OBSERVATION_DATE dupobs
#    <num>           <Date> <lgcl>
# 1:     1       2013-02-19   TRUE
# 2:     2       2013-02-28   TRUE
# 3:     1       2013-02-19   TRUE
# 4:     2       2013-02-28   TRUE
# 5:     3       2013-02-28  FALSE
# 6:     4       2013-03-08  FALSE
# 7:     3       2014-01-06  FALSE

dplyr

library(dplyr)
sample %>%
  group_by(MRN) %>%
  mutate(dupobs = any(duplicated(OBSERVATION_DATE))) %>%
  ungroup()
# # A tibble: 7 x 3
#     MRN OBSERVATION_DATE dupobs
#   <dbl> <date>           <lgl> 
# 1     1 2013-02-19       TRUE  
# 2     2 2013-02-28       TRUE  
# 3     1 2013-02-19       TRUE  
# 4     2 2013-02-28       TRUE  
# 5     3 2013-02-28       FALSE 
# 6     4 2013-03-08       FALSE 
# 7     3 2014-01-06       FALSE 

base R

sample$dupobs <- ave(as.integer(sample$OBSERVATION_DATE), sample$MRN,
                     FUN = function(z) any(duplicated(z))) > 0
sample
#   MRN OBSERVATION_DATE dupobs
# 1   1       2013-02-19   TRUE
# 2   2       2013-02-28   TRUE
# 3   1       2013-02-19   TRUE
# 4   2       2013-02-28   TRUE
# 5   3       2013-02-28  FALSE
# 6   4       2013-03-08  FALSE
# 7   3       2014-01-06  FALSE

With ave, the first argument's class is used for the output, which can be rather inconvenient. For this, I cast to integer (so that the function will work without error); the inner function will initially create a logical, but ave converts it to the integer (of the original vector), which translates false to 0 and true to 1. From there, I compare the output (0s and 1s) against 0 to see if it was true. Minor inconvenience.


Data

sample <- structure(list(MRN = c(1, 2, 1, 2, 3, 4, 3), OBSERVATION_DATE = structure(c(15755, 15764, 15755, 15764, 15764, 15772, 16076), class = "Date")), class = "data.frame", row.names = c(NA, -7L))

CodePudding user response:

It will probably be more efficient to count by group:

sampleDT[, n := .N, by=.(MRN, OBSERVATION_DATE)]
sampleDT[, flag := n == 1L]

     MRN OBSERVATION_DATE     n   flag
   <num>           <IDat> <int> <lgcl>
1:     1       2013-02-19     2  FALSE
2:     2       2013-02-28     2  FALSE
3:     1       2013-02-19     2  FALSE
4:     2       2013-02-28     2  FALSE
5:     3       2013-02-28     1   TRUE
6:     4       2013-03-08     1   TRUE
7:     3       2014-01-06     1   TRUE

Calculation of .N by group is optimized (see ?GForce), though I think it might not be enabled for := yet.

Input:

sampleDT <- data.table(
  MRN = c(1, 2, 1, 2, 3, 4, 3),
  OBSERVATION_DATE = as.IDate(c("2013-02-19", "2013-02-28", "2013-02-19", 
    "2013-02-28", "2013-02-28", "2013-03-08", "2014-01-06"))
)
  • Related