Home > Mobile >  Using distinct() by group and conditional on a value from another column in R
Using distinct() by group and conditional on a value from another column in R

Time:10-22

I have data that has duplicate rows and I would like to use distinct() in dplyr to get rid of these duplicates. However, I only want to run distinct() for a certain condition that is identified in another column. Here is an example dataset:

id <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
ind <- c(0,0,0,0,0,1,
    0,0,0,0,1,1,
    0,0,0,0,0)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
   
    "2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
    
    "2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00", "2017-09-11 00:00:00")
dat <- cbind(id,ind,dt)
dat <- data.frame(dat)
dat$id <- factor(dat$id)
dat$ind <- factor(dat$ind)
dat$dt <- as.POSIXct(dat$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat

I would like to be able to use distinct() within each id and only when ind==0. When ind==1, I don't want distinct() to run, as I want to keep any duplicate rows when ind==1. The resulting dataset would look like this:

id <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
ind <- c(0,0,0,0,1,
    0,0,0,1,1,
    0,0,0,0,0)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
   
    "2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
    
    "2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00", "2017-09-11 00:00:00")
dat2 <- cbind(id,ind,dt)
dat2 <- data.frame(dat2)
dat2$id <- factor(dat2$id)
dat2$ind <- factor(dat2$ind)
dat2$dt <- as.POSIXct(dat2$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat2

I have tried the following code, but it returns an error:

dat2 <- dat %>% 
    group_by(id) %>%
    distinct(dt[ind==0], .keep_all=TRUE) %>%
    ungroup()

Thanks for the help!

EDIT @akrun original answer works great for the above, but my example was more simple than my actual dataset. In addition to the above, I also have the scenario where within an id, there are duplicated rows that share the same dt but one row will have ind==0 and another ind==1. I need to keep only the row with ind==1, for this scenario. So in summary, if rows are duplicated by dt and ind==1 for all those rows, then I keep all of them. If rows are duplicated by dt and ind==0 for all those rows, then I keep only one of them. If rows are duplicated by dt and ind==0 for one and ind==1 for the other row, then I keep only the ind==1 row. Example dataset:

id <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
ind <- c(0,0,0,0,0,1,
    0,0,0,0,1,1,
    0,0,0,0,1)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
   
    "2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
    
    "2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00", "2017-09-10 00:00:00")
dat <- cbind(id,ind,dt)
dat <- data.frame(dat)
dat$id <- factor(dat$id)
dat$ind <- factor(dat$ind)
dat$dt <- as.POSIXct(dat$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat

What I want to get:

id <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
ind <- c(0,0,0,0,1,
    0,0,0,1,1,
    0,0,0,1)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
   
    "2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
    
    "2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00")
dat2 <- cbind(id,ind,dt)
dat2 <- data.frame(dat2)
dat2$id <- factor(dat2$id)
dat2$ind <- factor(dat2$ind)
dat2$dt <- as.POSIXct(dat2$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat2

Thanks!

CodePudding user response:

We may use filter

library(dplyr)
dat %>%
    group_by(id, ind) %>% 
    filter(if(all(ind == 0)) !duplicated(dt) else TRUE) %>% 
    ungroup

-output

# A tibble: 15 × 3
   id    ind   dt                 
   <fct> <fct> <dttm>             
 1 1     0     2017-06-30 00:00:00
 2 1     0     2017-07-01 00:00:00
 3 1     0     2017-07-04 00:00:00
 4 1     0     2017-07-05 00:00:00
 5 1     1     2017-07-06 00:00:00
 6 2     0     2017-08-04 00:00:00
 7 2     0     2017-08-07 00:00:00
 8 2     0     2017-08-09 00:00:00
 9 2     1     2017-08-10 00:00:00
10 2     1     2017-08-10 00:00:00
11 3     0     2017-09-04 00:00:00
12 3     0     2017-09-07 00:00:00
13 3     0     2017-09-09 00:00:00
14 3     0     2017-09-10 00:00:00
15 3     0     2017-09-11 00:00:00

Or we may use arrange as well

dat %>% 
   arrange(id, ind == 0, dt) %>% 
   group_by(id) %>%
   filter(!(duplicated(dt) & ind == 0)) %>%
   ungroup
  • Related