I have a data frame like this:
ID <- c("x", "y", "z", "ad", "tgfg", "gfgh", "asj", "gh", "sdf", "asdgz")
diagnosis <- c("yes", "no", "yes", "yes", "yes", "no", "yes", "no", "no", "no")
Q1 <- c("A",NA,"A",NA,NA,"C","D","A","B", NA)
Q2 <- c("D",NA,"D","C",NA,NA,"A","A","A","A")
Q3 <- c("B","B","C","A",NA,"A","B","D","E",NA)
Q4 <- c("B",NA,"C","C","C","C","D","B",NA,"A")
mydf <- data.frame(ID, diagnosis, Q1,Q2,Q3,Q4)
mydf
ID diagnosis Q1 Q2 Q3 Q4
x yes A D B B
y no <NA> <NA> B <NA>
z yes A D C C
ad yes <NA> C A C
tgfg yes <NA> <NA> <NA> C
gfgh no C <NA> A C
asj yes D A B D
gh no A A D B
sdf no B A E <NA>
asdgz no <NA> A <NA> A
I want to eliminate participants who did not answer more than 2 questions in the test(rows including more than 2 NAs) but only for the diagnosed group. I don't want to touch the undiagnosed group although they might have more than 2 NAs.
This is what I tried after my search on stackoverflow:
delete.na <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
new_df <- mydf %>% group_by(diagnosis== "yes") %>% delete.na(2) %>% ungroup()
new_df
ID diagnosis Q1 Q2 Q3 Q4 `diagnosis == "yes"`
x yes A D B B TRUE
z yes A D C C TRUE
ad yes NA C A C TRUE
gfgh no C NA A C FALSE
asj yes D A B D TRUE
gh no A A D B FALSE
sdf no B A E NA FALSE
asdgz no NA A NA A FALSE
As you can see, although participant "y" doesn't belong to "diagnosis == yes" group, it got deleted too. What should I do to prevent it?
Thanks a lot!
CodePudding user response:
Add an OR condition DF$diagnosis == "no"
to preserve them in the DF.
delete.na.but.no.diag <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n | DF$diagnosis == "no",]
}
new_df <- mydf %>% group_by(diagnosis== "yes") %>% delete.na.but.no.diag(2) %>% ungroup()
new_df
# ID diagnosis Q1 Q2 Q3 Q4 `diagnosis == "yes"`
# <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
# 1 x yes A D B B TRUE
# 2 y no NA NA B NA FALSE
# 3 z yes A D C C TRUE
# 4 ad yes NA C A C TRUE
# 5 gfgh no C NA A C FALSE
# 6 asj yes D A B D TRUE
# 7 gh no A A D B FALSE
# 8 sdf no B A E NA FALSE
# 9 asdgz no NA A NA A FALSE
CodePudding user response:
Here's another alternative
> mydf %>%
mutate(na.count = rowSums(is.na(mydf %>% select(starts_with("Q"))))) %>%
filter(diagnosis == "no" | na.count <= 2) %>%
select(-na.count)
ID diagnosis Q1 Q2 Q3 Q4
1 x yes A D B B
2 y no <NA> <NA> B <NA>
3 z yes A D C C
4 ad yes <NA> C A C
5 gfgh no C <NA> A C
6 asj yes D A B D
7 gh no A A D B
8 sdf no B A E <NA>
9 asdgz no <NA> A <NA> A
CodePudding user response:
In base R you could do:
subset(mydf, !(rowSums(is.na(mydf)) >2 & diagnosis == 'yes'))
ID diagnosis Q1 Q2 Q3 Q4
1 x yes A D B B
2 y no <NA> <NA> B <NA>
3 z yes A D C C
4 ad yes <NA> C A C
6 gfgh no C <NA> A C
7 asj yes D A B D
8 gh no A A D B
9 sdf no B A E <NA>
10 asdgz no <NA> A <NA> A
CodePudding user response:
You can use .
notation and put the rowSums
logic directly into filter
mydf %>%
filter(rowSums(is.na(.)) <= 2 | diagnosis == 'no')
#-----
ID diagnosis Q1 Q2 Q3 Q4
1 x yes A D B B
2 y no <NA> <NA> B <NA>
3 z yes A D C C
4 ad yes <NA> C A C
5 gfgh no C <NA> A C
6 asj yes D A B D
7 gh no A A D B
8 sdf no B A E <NA>
9 asdgz no <NA> A <NA> A