Creating a new column based on values obtained from different column, using mutate() and case

I am a student relatively new to R and have learnt a lot from browsing here, I have been stuck on something recently which after hours of trying still haven't been able to figure out what to do. Let's propose the following data set:

ID Y1 Y2 Y3 Y4

1 0 0 1 1

2 0 0 0 0

3 NA NA NA NA

I want to create a new column where it is filled based upon the following the conditions:

If the row contains 1, return 1 regardless of NA or 0
If it contains a mix of 0 and NA but not 1, return 0
If it only contains NA, return NA

So using the example above I wanted to get the following:

ID Y1 Y2 Y3 Y4 Outcome

1 0 0 1 1 1

2 0 0 0 0 0

3 NA NA NA NA NA

However, the code I tried:

Data2 <- Data %>% mutate(Outcome = case_when( 
                                Data$Y1 == "na" &
                                Data$Y2 == "na" &
                                Data$Y3 == "na" &
                                Data$Y4 == "na" ~ "na"))  %>%                                
          mutate(Outcome = case_when(Data$Y1 == 1 ~ "1", 
                                 Data$Y2 == 1 ~ "1", 
                                 Data$Y3 == 1 ~ "1",
                                 Data$Y4 == 1 ~ "1",
                                 TRUE ~ "No"))

will return with:

ID Y1 Y2 Y3 Y4 Outcome

1 0 0 1 1 1

2 0 0 0 0 0

3 NA NA NA NA 0

which seems to ignore condition 3 where if it only contains na, return na.

Any pointers as to what I done wrong would be greatly appreciated.

Please forgive the formatting, I'm not sure how I could make it prettier as this is the first time I asked a question here.

Many thanks in advance!

[Edit] Thanks to Shah I noticed that there is potential for confusion, for that I apologise. I need give some clarification that this is just a segment of the data set to get the point across. I'm dealing with a big dataset which contains more columns, some of which also have numeric values.

CodePudding user response：

Checking for each column (Y1, Y2, Y3 etc) is too tedious and not scalable. It becomes a big problem if you have 100 columns where you need this.

As showed in example you want to ignore the 1st column (ID) and include all other columns in the calculation you can do the following. -1 in the answer is to ignore the 1st column ID.

Also use is.na to compare the NA values.

#Count number of non-NA values, this is used later to change the rows
#with all NA values to NA in outcome
non_NA <- rowSums(!is.na(df[-1]))
#Assign 1 if the count of 1 is greater than 0 in a row
df$Outcome <- as.integer(rowSums(df[-1], na.rm = TRUE) > 0)
#turn the outcome variable to NA for rows which has all NA values. 
df$Outcome[non_NA == 0] <- NA
df
#  ID Y1 Y2 Y3 Y4 Outcome
#1  1  0  0  1  1       1
#2  2  0  0  0  0       0
#3  3 NA NA NA NA      NA

data

df <- structure(list(ID = 1:3, Y1 = c(0L, 0L, NA), Y2 = c(0L, 0L, NA
), Y3 = c(1L, 0L, NA), Y4 = c(1L, 0L, NA)), 
class = "data.frame", row.names = c(NA, -3L))

CodePudding user response：

You can try this using dplyr rowwise function which treat each row separately

library(dplyr)

df |> rowwise() |> 
mutate(Outcome = case_when(any(c_across(Y1:Y4) == 1) ~ "1" ,
 all(is.na(c_across(Y1:Y4))) ~ NA_character_ , TRUE ~ "0"))

output

# A tibble: 3 × 6
# Rowwise: 
     ID    Y1    Y2    Y3    Y4 Outcome
  <int> <int> <int> <int> <int> <chr>  
1     1     0     0     1     1 1      
2     2     0     0     0     0 0      
3     3    NA    NA    NA    NA NA