Home > other >  Conditionally remove duplicated rows by group
Conditionally remove duplicated rows by group

Time:10-26

I have a survey and my data looks something like this:

dt<-structure(list(ID = c("183577", "183577", "183907", "183907", 
"184188", "184188", "184188", "184188", "184188", "185167", "185167", 
"185167"), Question = c("7.6", "7.6", "7.7", "7.7", "1.1", "1.1", 
"1.2", "1.2", "10.1", "7.7", "7.7", "7.7"), Answer = c("PARTIALLY", 
"YES", "", "", "", "PARTIALLY", "YES", "", "", "", "YES", "PARTIALLY"
), Control = c(-2.93736019374946, 1.01801705406142, 0.0598708062395571, 
-0.456635810693228, 3.04311151438148, 0.641092485370467, 0.518503165474265, 
0.284680056109131, 1.98580865602238, -0.547063974950295, -0.507700003072695, 
-0.194028453167317)), row.names = c(NA, -12L), class = c("data.table", 
"data.frame"), index = integer(0))

dt

        ID Question    Answer     Control
 1: 183577      7.6 PARTIALLY -2.93736019
 2: 183577      7.6       YES  1.01801705
 3: 183907      7.7            0.05987081
 4: 183907      7.7           -0.45663581
 5: 184188      1.1            3.04311151
 6: 184188      1.1 PARTIALLY  0.64109249
 7: 184188      1.2       YES  0.51850317
 8: 184188      1.2            0.28468006
 9: 184188     10.1            1.98580866
10: 185167      7.7           -0.54706397
11: 185167      7.7       YES -0.50770000
12: 185167      7.7 PARTIALLY -0.19402845

In the Answer variable, there are several missing values "". For some of these questions however, I have another row with the answer of the individual where the row takes a non missing value (e.g. PARTIALLY, YES, NO).

I would like to drop all of the duplicated rows where I have the answer that is a missing value (Answer==""), and another answer that takes a real value. Hence, for instance, dropping row 5 and keeping row 6. However, when none of the answers has a non-missing value (e.g. row 3 and 4 and row 9) I would like to keep the observations with missing values.

Does anyone know how I could achieve this?

the final dataset should look like this

       ID Question    Answer     Control
1: 183577      7.6 PARTIALLY -2.93736019
2: 183577      7.6       YES  1.01801705
3: 183907      7.7            0.05987081
4: 183907      7.7           -0.45663581
5: 184188      1.1 PARTIALLY  0.64109249
6: 184188      1.2       YES  0.51850317
7: 184188     10.1            1.98580866
8: 185167      7.7       YES -0.50770000
9: 185167      7.7 PARTIALLY -0.19402845

Please do note the special case of rows 1 and 2. Where I have two different non-missing values. as it is reflected in the final dataset also in this case I would like to keep both observations.

Thanks

CodePudding user response:

Try this, we filter based on one of two conditions being met. 1) Keep the row if all answers are blank 2) Keep the row if an answer is not blank

library(tidyverse)

dt %>% group_by(ID, Question) %>% 
  filter(all(Answer == "") | (Answer != ""))

# A tibble: 9 x 4
# Groups:   ID, Question [6]
  ID     Question Answer      Control
  <chr>  <chr>    <chr>         <dbl>
1 183577 7.6      "PARTIALLY" -2.94  
2 183577 7.6      "YES"        1.02  
3 183907 7.7      ""           0.0599
4 183907 7.7      ""          -0.457 
5 184188 1.1      "PARTIALLY"  0.641 
6 184188 1.2      "YES"        0.519 
7 184188 10.1     ""           1.99  
8 185167 7.7      "YES"       -0.508 
9 185167 7.7      "PARTIALLY" -0.194 
  • Related