Home > Blockchain >  dealing with missing data in row-wise manipulation in r
dealing with missing data in row-wise manipulation in r

Time:06-10

ID<-1:6
math<-c("YES","NO","YES","NO",NA,NA)
history<-c(NA,NA,"NO","NO","YES",NA)

dt<-data.frame(ID,math,history)

  ID math history
1  1  YES    <NA>
2  2   NO    <NA>
3  3  YES      NO
4  4   NO      NO
5  5 <NA>     YES
6  6 <NA>    <NA>

I want to make an additional column ("pass") as follows

  1. if a student ever had "yes" at least once: "YES" (No matter if another subject is missing data or not or so.)

  2. if the student did not get "yes"

2-1) If both subjects are missing data : NA

2-2) if one of subjects is "NO" : "NO"

so, the columns would like this: (I can do this manually with this minimal example. but not with my real data)

> dt
  ID math history pass
1  1  YES    <NA>  YES
2  2   NO    <NA>   NO
3  3  YES      NO  YES
4  4   NO      NO   NO
5  5 <NA>     YES  YES
6  6 <NA>    <NA> <NA>


I tried to use

dt$pass <- ifelse(rowSums(dt[,-1]=="YES",na.rm=T)>0,"YES","NO")

this code, but it was tricky because if I put na.rm=T they consider NA is "NO" (ID 6 student will be "NO")

if I put na.rm=F, an only students that have both subject's data are considered.

In my data, I have really lots of columns, not only math and history..

CodePudding user response:

library(tidyverse)

ID <- 1:6
math <- c("YES", "NO", "YES", "NO", NA, NA)
history <- c(NA, NA, "NO", "NO", "YES", NA)

dt <- data.frame(ID, math, history)

dt |> 
  rowwise() |> 
  mutate(pass = case_when(
    sum(c_across(-ID) == "YES", na.rm = TRUE) >= 1 ~ "YES",
    sum(c_across(-ID) == "NO", na.rm = TRUE) >= 1  ~ "No",
    TRUE                                           ~ NA_character_
  ))
#> # A tibble: 6 × 4
#> # Rowwise: 
#>      ID math  history pass 
#>   <int> <chr> <chr>   <chr>
#> 1     1 YES   <NA>    YES  
#> 2     2 NO    <NA>    No   
#> 3     3 YES   NO      YES  
#> 4     4 NO    NO      No   
#> 5     5 <NA>  YES     YES  
#> 6     6 <NA>  <NA>    <NA>

Created on 2022-06-10 by the reprex package (v2.0.1)

CodePudding user response:

Seems like the order is what matters:

dt$pass <- NA
dt$pass[dt$math == "NO" | dt$history=="NO"] <- "NO"
dt$pass[dt$math == "YES" | dt$history=="YES"] <- "YES"

> dt
  ID math history pass
1  1  YES    <NA>  YES
2  2   NO    <NA>   NO
3  3  YES      NO  YES
4  4   NO      NO   NO
5  5 <NA>     YES  YES
6  6 <NA>    <NA> <NA>

CodePudding user response:

Try this

library(dplyr)
dt %>% mutate(pass = case_when(math == "YES" | history == "YES" ~ "YES" ,
 is.na(math) & is.na(history) ~ NA_character_ ,
 TRUE ~ "NO"))

CodePudding user response:

A tidyverse solution that is robust with respect to the number (and names) of subjects:

library(tidyverse)

dt %>% 
  mutate(
    pass=dt %>% 
           pivot_longer(-ID) %>% 
           group_by(ID) %>% 
           summarise(
             anyYes=sum(value == "YES", na.rm=T), 
             anyNo=sum(value == "NO", na.rm=T)
           ) %>% 
           mutate(
             pass=ifelse(
                    anyYes >= 1, 
                    "YES", 
                    ifelse(anyNo >= 1, "NO", NA)
                   )
           ) %>% 
           pull(pass)
  )
  ID math history pass
1  1  YES    <NA>  YES
2  2   NO    <NA>   NO
3  3  YES      NO  YES
4  4   NO      NO   NO
5  5 <NA>     YES  YES
6  6 <NA>    <NA> <NA>

The key is the pivoting to long format.

  • Related