Home > OS >  r column replace by group
r column replace by group

Time:04-14

I am dealing with a dataset with duplicate rows like this

Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      0       NA       1
1    2004-11-29  1     2      1       0        0
2    2005-04-26  2     2      NA      1        0
3    2006-10-09  1     2      1       0        1
3    2006-10-09  1     2      0       0        NA

What I like to do is, among the two rows for the same ID , same date, if the value in Col1, Col2, Col3 is 1 then replace the value with 1, else its a 0 or NA if both values are missing in that column.

For example, ID 1 two rows with same date,

  • Col1 0, 1 - replace the value to 1,
  • Col2 NA,0 - replace the value to 0
  • Col3 1, 0 - replace the value to 1

Two rows with

  • 0,1 gets replaced with 1,
  • NA, NA is replaced with NA
  • NA, 0 is replaced with 0
  • NA, 1 is replaced with 1
  • 1, 1 is replaced with 1
  • 0, 0 is replaced with 0

So on

Expecting a dataset like this

Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      1       0       1

2    2005-04-26  2     2      NA      1        0

3    2006-10-09  1     2      1       0        1

Thanks for any help regarding this in advance.

CodePudding user response:

library(data.table)

#convert NA to -999
DT[is.na(DT)] <- -999
#summarise, find maximum (if all original = NA, maximum will be -999)
ans <- DT[, lapply(.SD, max), by = .(Id, Date, x1, a1), .SDcols = patterns("^Col")]
#convert -999 back to NA
ans[ans == -999] <- NA
#    Id       Date x1 a1 Col1 Col2 Col3
# 1:  1 2004-11-29  1  2    1    0    1
# 2:  2 2005-04-26  2  2   NA    1    0
# 3:  3 2006-10-09  1  2    1    0    1

sample data used

DT <- fread("Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      0       NA       1
1    2004-11-29  1     2      1       0        0
2    2005-04-26  2     2      NA      1        0
3    2006-10-09  1     2      1       0        1
3    2006-10-09  1     2      0       0        NA")

CodePudding user response:

If you are open to a dplyr, you could use

library(dplyr)
df %>% 
  group_by(Id, Date, x1, a1) %>% 
  summarise(across(Col1:Col3, ~na_if(max(coalesce(.x, -1)), -1)),
            .groups = "drop")

This returns

# A tibble: 3 x 7
     Id Date          x1    a1  Col1  Col2  Col3
  <dbl> <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 2004-11-29     1     2     1     0     1
2     2 2005-04-26     2     2    NA     1     0
3     3 2006-10-09     1     2     1     0     1

The main idea here is to always select the max value per column and group. This is based on the assumption the values are either 0 or 1 or missing.

  • Related