I am dealing with a dataset with duplicate rows like this
Id Date x1 a1 Col1 Col2 Col3
1 2004-11-29 1 2 0 NA 1
1 2004-11-29 1 2 1 0 0
2 2005-04-26 2 2 NA 1 0
3 2006-10-09 1 2 1 0 1
3 2006-10-09 1 2 0 0 NA
What I like to do is, among the two rows for the same ID , same date, if the value in Col1, Col2, Col3 is 1 then replace the value with 1, else its a 0 or NA if both values are missing in that column.
For example, ID 1 two rows with same date,
- Col1 0, 1 - replace the value to 1,
- Col2 NA,0 - replace the value to 0
- Col3 1, 0 - replace the value to 1
Two rows with
- 0,1 gets replaced with 1,
- NA, NA is replaced with NA
- NA, 0 is replaced with 0
- NA, 1 is replaced with 1
- 1, 1 is replaced with 1
- 0, 0 is replaced with 0
So on
Expecting a dataset like this
Id Date x1 a1 Col1 Col2 Col3
1 2004-11-29 1 2 1 0 1
2 2005-04-26 2 2 NA 1 0
3 2006-10-09 1 2 1 0 1
Thanks for any help regarding this in advance.
CodePudding user response:
library(data.table)
#convert NA to -999
DT[is.na(DT)] <- -999
#summarise, find maximum (if all original = NA, maximum will be -999)
ans <- DT[, lapply(.SD, max), by = .(Id, Date, x1, a1), .SDcols = patterns("^Col")]
#convert -999 back to NA
ans[ans == -999] <- NA
# Id Date x1 a1 Col1 Col2 Col3
# 1: 1 2004-11-29 1 2 1 0 1
# 2: 2 2005-04-26 2 2 NA 1 0
# 3: 3 2006-10-09 1 2 1 0 1
sample data used
DT <- fread("Id Date x1 a1 Col1 Col2 Col3
1 2004-11-29 1 2 0 NA 1
1 2004-11-29 1 2 1 0 0
2 2005-04-26 2 2 NA 1 0
3 2006-10-09 1 2 1 0 1
3 2006-10-09 1 2 0 0 NA")
CodePudding user response:
If you are open to a dplyr
, you could use
library(dplyr)
df %>%
group_by(Id, Date, x1, a1) %>%
summarise(across(Col1:Col3, ~na_if(max(coalesce(.x, -1)), -1)),
.groups = "drop")
This returns
# A tibble: 3 x 7
Id Date x1 a1 Col1 Col2 Col3
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2004-11-29 1 2 1 0 1
2 2 2005-04-26 2 2 NA 1 0
3 3 2006-10-09 1 2 1 0 1
The main idea here is to always select the max value per column and group. This is based on the assumption the values are either 0 or 1 or missing.