I have data that has duplicate rows and I would like to use distinct() in dplyr to get rid of these duplicates. However, I only want to run distinct() for a certain condition that is identified in another column. Here is an example dataset:
id <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
ind <- c(0,0,0,0,0,1,
0,0,0,0,1,1,
0,0,0,0,0)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
"2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
"2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00", "2017-09-11 00:00:00")
dat <- cbind(id,ind,dt)
dat <- data.frame(dat)
dat$id <- factor(dat$id)
dat$ind <- factor(dat$ind)
dat$dt <- as.POSIXct(dat$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat
I would like to be able to use distinct() within each id and only when ind==0. When ind==1, I don't want distinct() to run, as I want to keep any duplicate rows when ind==1. The resulting dataset would look like this:
id <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
ind <- c(0,0,0,0,1,
0,0,0,1,1,
0,0,0,0,0)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
"2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
"2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00", "2017-09-11 00:00:00")
dat2 <- cbind(id,ind,dt)
dat2 <- data.frame(dat2)
dat2$id <- factor(dat2$id)
dat2$ind <- factor(dat2$ind)
dat2$dt <- as.POSIXct(dat2$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat2
I have tried the following code, but it returns an error:
dat2 <- dat %>%
group_by(id) %>%
distinct(dt[ind==0], .keep_all=TRUE) %>%
ungroup()
Thanks for the help!
EDIT @akrun original answer works great for the above, but my example was more simple than my actual dataset. In addition to the above, I also have the scenario where within an id, there are duplicated rows that share the same dt but one row will have ind==0 and another ind==1. I need to keep only the row with ind==1, for this scenario. So in summary, if rows are duplicated by dt and ind==1 for all those rows, then I keep all of them. If rows are duplicated by dt and ind==0 for all those rows, then I keep only one of them. If rows are duplicated by dt and ind==0 for one and ind==1 for the other row, then I keep only the ind==1 row. Example dataset:
id <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3)
ind <- c(0,0,0,0,0,1,
0,0,0,0,1,1,
0,0,0,0,1)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
"2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
"2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00", "2017-09-10 00:00:00")
dat <- cbind(id,ind,dt)
dat <- data.frame(dat)
dat$id <- factor(dat$id)
dat$ind <- factor(dat$ind)
dat$dt <- as.POSIXct(dat$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat
What I want to get:
id <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
ind <- c(0,0,0,0,1,
0,0,0,1,1,
0,0,0,1)
dt <- c("2017-06-30 00:00:00", "2017-07-01 00:00:00", "2017-07-04 00:00:00", "2017-07-05 00:00:00", "2017-07-06 00:00:00",
"2017-08-04 00:00:00", "2017-08-07 00:00:00", "2017-08-09 00:00:00", "2017-08-10 00:00:00", "2017-08-10 00:00:00",
"2017-09-04 00:00:00", "2017-09-07 00:00:00", "2017-09-09 00:00:00", "2017-09-10 00:00:00")
dat2 <- cbind(id,ind,dt)
dat2 <- data.frame(dat2)
dat2$id <- factor(dat2$id)
dat2$ind <- factor(dat2$ind)
dat2$dt <- as.POSIXct(dat2$dt,tz="America/Chicago",format="%Y-%m-%d %H:%M:%OS")
dat2
Thanks!
CodePudding user response:
We may use filter
library(dplyr)
dat %>%
group_by(id, ind) %>%
filter(if(all(ind == 0)) !duplicated(dt) else TRUE) %>%
ungroup
-output
# A tibble: 15 × 3
id ind dt
<fct> <fct> <dttm>
1 1 0 2017-06-30 00:00:00
2 1 0 2017-07-01 00:00:00
3 1 0 2017-07-04 00:00:00
4 1 0 2017-07-05 00:00:00
5 1 1 2017-07-06 00:00:00
6 2 0 2017-08-04 00:00:00
7 2 0 2017-08-07 00:00:00
8 2 0 2017-08-09 00:00:00
9 2 1 2017-08-10 00:00:00
10 2 1 2017-08-10 00:00:00
11 3 0 2017-09-04 00:00:00
12 3 0 2017-09-07 00:00:00
13 3 0 2017-09-09 00:00:00
14 3 0 2017-09-10 00:00:00
15 3 0 2017-09-11 00:00:00
Or we may use arrange
as well
dat %>%
arrange(id, ind == 0, dt) %>%
group_by(id) %>%
filter(!(duplicated(dt) & ind == 0)) %>%
ungroup