How to check for the existence of a certain value in a set of variables only when there is no NA?-CodePudding

I have a dataframe with over hundreds of variables, grouped in different factors ("Happy_","Sad_", etc) and I want to create a set new variables indicating whether a participant put a rating of 4 in any of the variables in one factor. However, if any of the variable in that factor is NA, then the new variable will also output NA.

I have tried the following, but it didn't work:

library(tidyverse)
df <- data.frame(Subj = c("A", "B", "C", "D"),
                 Happy_1_Num = c(4,2,2,NA),
                 Happy_2_Num = c(4,2,2,1),
                 Happy_3_Num = c(1,NA,2,4),
                 Sad_1_Num = c(2,1,4,3),
                 Sad_2_Num = c(NA,1,2,4),
                 Sad_3_Num = c(4,2,2,1))

# Don't work
df <- df %>% mutate(Happy_Any4 = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), NA,
                                                                 ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
                    Sad_Any4 = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), NA,
                                      ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))

I tried a workaround by first generating a set of variables to indicate if that factor has any NA, and after that check if participant put any rating of "4". it works; but since I have many factors, I was wondering if there is a more elegant way of doing it.

# workaround
df <- df %>% mutate(
  NA_Happy = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), 1,0),
  NA_Sad = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), 1,0))

df <- df %>% mutate(
  Happy_Any4 = ifelse(NA_Happy == 1, NA,
                        ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
  Sad_Any4 = ifelse(NA_Sad == 1, NA,
                        ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))

CodePudding user response：

Here is a base R option using split.default -

tmp <- df[-1]
cbind(df, sapply(split.default(tmp, sub('_.*', '', names(tmp))), 
                 function(x) as.integer(rowSums(x== 4) > 0)))

#  Subj Happy_1_Num Happy_2_Num Happy_3_Num Sad_1_Num Sad_2_Num Sad_3_Num Happy Sad
#1    A           4           4           1         2        NA         4     1  NA
#2    B           2           2          NA         1         1         2    NA   0
#3    C           2           2           2         4         2         2     0   1
#4    D          NA           1           4         3         4         1    NA   1

sub would keep only either "Happy" or "Sad" part of the names, split.default splits the data based on that and use sapply to calculate if any value of 4 is present in a row.

If you can afford to write each and every factor manually you can do -

library(dplyr)

df %>%
  mutate(Happy = as.integer(rowSums(select(., starts_with('Happy')) == 4) > 0), 
         Sad = as.integer(rowSums(select(., starts_with('Sad')) == 4) > 0))

CodePudding user response：

here is another workaround by transposing the data.frame and an apply on colonns. I'm not sure it's more elegant but here it is ^^

tmp <- cbind(sub("^((Happy)|(Sad))(_.*_Num)$", "\\1", colnames(df)), t(df))
Happy_Any4 <- apply(tmp[tmp[,1]== "Happy", -1], 2, 
                    function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )
Sad_Any4 <- apply(tmp[tmp[,1]== "Sad", -1], 2, 
                    function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )

df <- cbind(df, Happy_Any4 = Happy_Any4, Sad_Any4 = Sad_Any4)

EDIT : Above was a strange test, but now this work with more beauty !

This is because the sum of anything where there is an NA will return NA.

df <- df %>% mutate(Happy_Any4 = apply(df[,grep("^Happy_.*_Num$", colnames(df))], 
                                       1, function(x) 1*(sum(x == 4) > 0)),
                    Sad_Any4 = apply(df[, grep("^Sad_.*_Num$", colnames(df))], 
                                     1, function(x) 1*(sum(x == 4) > 0)))

The apply will look every row, only on columns where we find the correct part in colnames (with grep. It then find every occurence of 4, which form a logical vector, and it's sum is the number of occurence. The presence of an NA will bring the sum to NA. I then just check if the sum is above 0 and the 1* will turn the numeric into logical.