Home > OS >  Delete rows based on specific amount of NA and for a specific group in R
Delete rows based on specific amount of NA and for a specific group in R

Time:10-20

I have a data frame like this:

ID <- c("x", "y", "z", "ad", "tgfg", "gfgh", "asj", "gh", "sdf", "asdgz")
diagnosis <- c("yes", "no", "yes", "yes", "yes", "no", "yes", "no", "no", "no")
Q1 <- c("A",NA,"A",NA,NA,"C","D","A","B", NA)
Q2 <- c("D",NA,"D","C",NA,NA,"A","A","A","A")
Q3 <- c("B","B","C","A",NA,"A","B","D","E",NA)
Q4 <- c("B",NA,"C","C","C","C","D","B",NA,"A")
mydf <- data.frame(ID, diagnosis, Q1,Q2,Q3,Q4)
mydf

ID diagnosis   Q1   Q2   Q3   Q4
x       yes    A    D    B    B
y        no <NA> <NA>    B <NA>
z       yes    A    D    C    C
ad      yes <NA>    C    A    C
tgfg    yes <NA> <NA> <NA>    C
gfgh     no    C <NA>    A    C
asj     yes    D    A    B    D
gh       no    A    A    D    B
sdf      no    B    A    E <NA>
asdgz    no <NA>    A <NA>    A

I want to eliminate participants who did not answer more than 2 questions in the test(rows including more than 2 NAs) but only for the diagnosed group. I don't want to touch the undiagnosed group although they might have more than 2 NAs.

This is what I tried after my search on stackoverflow:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}
new_df <- mydf %>% group_by(diagnosis== "yes") %>% delete.na(2) %>% ungroup()
new_df

ID    diagnosis Q1    Q2    Q3    Q4    `diagnosis == "yes"`              
x     yes       A     D     B     B     TRUE                
z     yes       A     D     C     C     TRUE                
ad    yes       NA    C     A     C     TRUE                
gfgh  no        C     NA    A     C     FALSE               
asj   yes       D     A     B     D     TRUE                
gh    no        A     A     D     B     FALSE               
sdf   no        B     A     E     NA    FALSE               
asdgz no        NA    A     NA    A     FALSE

As you can see, although participant "y" doesn't belong to "diagnosis == yes" group, it got deleted too. What should I do to prevent it?

Thanks a lot!

CodePudding user response:

Add an OR condition DF$diagnosis == "no" to preserve them in the DF.

delete.na.but.no.diag <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n | DF$diagnosis == "no",]
}
new_df <- mydf %>% group_by(diagnosis== "yes") %>% delete.na.but.no.diag(2) %>% ungroup()
new_df

#   ID    diagnosis Q1    Q2    Q3    Q4    `diagnosis == "yes"`
#   <chr> <chr>     <chr> <chr> <chr> <chr> <lgl>               
# 1 x     yes       A     D     B     B     TRUE                
# 2 y     no        NA    NA    B     NA    FALSE               
# 3 z     yes       A     D     C     C     TRUE                
# 4 ad    yes       NA    C     A     C     TRUE                
# 5 gfgh  no        C     NA    A     C     FALSE               
# 6 asj   yes       D     A     B     D     TRUE                
# 7 gh    no        A     A     D     B     FALSE               
# 8 sdf   no        B     A     E     NA    FALSE               
# 9 asdgz no        NA    A     NA    A     FALSE       

CodePudding user response:

Here's another alternative

> mydf %>% 
    mutate(na.count = rowSums(is.na(mydf %>% select(starts_with("Q"))))) %>% 
    filter(diagnosis == "no" | na.count <= 2) %>% 
    select(-na.count)

     ID diagnosis   Q1   Q2   Q3   Q4
1     x       yes    A    D    B    B
2     y        no <NA> <NA>    B <NA>
3     z       yes    A    D    C    C
4    ad       yes <NA>    C    A    C
5  gfgh        no    C <NA>    A    C
6   asj       yes    D    A    B    D
7    gh        no    A    A    D    B
8   sdf        no    B    A    E <NA>
9 asdgz        no <NA>    A <NA>    A

CodePudding user response:

In base R you could do:

subset(mydf, !(rowSums(is.na(mydf)) >2 & diagnosis == 'yes'))

      ID diagnosis   Q1   Q2   Q3   Q4
1      x       yes    A    D    B    B
2      y        no <NA> <NA>    B <NA>
3      z       yes    A    D    C    C
4     ad       yes <NA>    C    A    C
6   gfgh        no    C <NA>    A    C
7    asj       yes    D    A    B    D
8     gh        no    A    A    D    B
9    sdf        no    B    A    E <NA>
10 asdgz        no <NA>    A <NA>    A

CodePudding user response:

You can use . notation and put the rowSums logic directly into filter

mydf %>%
  filter(rowSums(is.na(.)) <= 2 | diagnosis == 'no')

#-----
     ID diagnosis   Q1   Q2   Q3   Q4
1     x       yes    A    D    B    B
2     y        no <NA> <NA>    B <NA>
3     z       yes    A    D    C    C
4    ad       yes <NA>    C    A    C
5  gfgh        no    C <NA>    A    C
6   asj       yes    D    A    B    D
7    gh        no    A    A    D    B
8   sdf        no    B    A    E <NA>
9 asdgz        no <NA>    A <NA>    A
  • Related