I have a survey and my data looks something like this:
dt<-structure(list(ID = c("183577", "183577", "183907", "183907",
"184188", "184188", "184188", "184188", "184188", "185167", "185167",
"185167"), Question = c("7.6", "7.6", "7.7", "7.7", "1.1", "1.1",
"1.2", "1.2", "10.1", "7.7", "7.7", "7.7"), Answer = c("PARTIALLY",
"YES", "", "", "", "PARTIALLY", "YES", "", "", "", "YES", "PARTIALLY"
), Control = c(-2.93736019374946, 1.01801705406142, 0.0598708062395571,
-0.456635810693228, 3.04311151438148, 0.641092485370467, 0.518503165474265,
0.284680056109131, 1.98580865602238, -0.547063974950295, -0.507700003072695,
-0.194028453167317)), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), index = integer(0))
dt
ID Question Answer Control
1: 183577 7.6 PARTIALLY -2.93736019
2: 183577 7.6 YES 1.01801705
3: 183907 7.7 0.05987081
4: 183907 7.7 -0.45663581
5: 184188 1.1 3.04311151
6: 184188 1.1 PARTIALLY 0.64109249
7: 184188 1.2 YES 0.51850317
8: 184188 1.2 0.28468006
9: 184188 10.1 1.98580866
10: 185167 7.7 -0.54706397
11: 185167 7.7 YES -0.50770000
12: 185167 7.7 PARTIALLY -0.19402845
In the Answer variable, there are several missing values ""
. For some of these questions however, I have another row with the answer of the individual where the row takes a non missing value (e.g. PARTIALLY, YES, NO).
I would like to drop all of the duplicated rows where I have the answer that is a missing value (Answer==""
), and another answer that takes a real value. Hence, for instance, dropping row 5 and keeping row 6. However, when none of the answers has a non-missing value (e.g. row 3 and 4 and row 9) I would like to keep the observations with missing values.
Does anyone know how I could achieve this?
the final dataset should look like this
ID Question Answer Control
1: 183577 7.6 PARTIALLY -2.93736019
2: 183577 7.6 YES 1.01801705
3: 183907 7.7 0.05987081
4: 183907 7.7 -0.45663581
5: 184188 1.1 PARTIALLY 0.64109249
6: 184188 1.2 YES 0.51850317
7: 184188 10.1 1.98580866
8: 185167 7.7 YES -0.50770000
9: 185167 7.7 PARTIALLY -0.19402845
Please do note the special case of rows 1 and 2. Where I have two different non-missing values. as it is reflected in the final dataset also in this case I would like to keep both observations.
Thanks
CodePudding user response:
Try this, we filter based on one of two conditions being met. 1) Keep the row if all answers are blank 2) Keep the row if an answer is not blank
library(tidyverse)
dt %>% group_by(ID, Question) %>%
filter(all(Answer == "") | (Answer != ""))
# A tibble: 9 x 4
# Groups: ID, Question [6]
ID Question Answer Control
<chr> <chr> <chr> <dbl>
1 183577 7.6 "PARTIALLY" -2.94
2 183577 7.6 "YES" 1.02
3 183907 7.7 "" 0.0599
4 183907 7.7 "" -0.457
5 184188 1.1 "PARTIALLY" 0.641
6 184188 1.2 "YES" 0.519
7 184188 10.1 "" 1.99
8 185167 7.7 "YES" -0.508
9 185167 7.7 "PARTIALLY" -0.194