Select columns on match with vector and create ifelse condition with their content-CodePudding

I have a dataset with over several diseases, 0 indicating not having the disease and 1 having the disease.

To illustrate it with an example: I am interested in Diseases A and whether the people in the dataset have this diseases on its own or as the cause of another disease. Therefore I want to create a new variable "Type" with the values "NotDiseasedWithA", "Primary" and "Secondary". The diseases that can cause A are contained in a vector "SecondaryCauses":

SecondaryCauses = c("DiseaseB", "DiseaseD")

"NotDiseasedWithA" means that they do not have disease A. "Primary" means that they have disease A but not any of the known diseases that can cause it. "Secondary" means that they have disease A and a diseases that probably caused it.

Sample data

ID  DiseaseA    DiseaseB    DiseaseC    DiseaseD    DiseaseE
1   0           1           0           0           0
2   1           0           0           0           1
3   1           0           1           1           0
4   1           0           1           1           1
5   0           0           0           0           0

My question is:

How do I select the columns I am interested in? I have more than 20 columns that are not ordered. Therefore I created the vector.
How do I create the condition based on the content of the diseases I am interested in?

I tried something like the following, but this did not work:

DF %>% mutate(Type = ifelse(DiseaseA == 0, "NotDiseasedWithA", ifelse(sum(names(DF) %in% SecondaryCauses) > 0, "Secondary", "Primary")))

So in the end I want to have this results:

ID  DiseaseA    DiseaseB    DiseaseC    DiseaseD    DiseaseE    Type
1   0           1           0           0           0           NotDiseasedWithA
2   1           0           0           0           1           Primary
3   1           0           1           1           0           Secondary
4   1           0           1           1           1           Secondary
5   0           0           0           0           0           NotDiseasedWithA

CodePudding user response：

You should use rowSums instead of sum, and use df[, SecondaryCauses] to locate the columns that are in SecondaryCauses.

library(tidyverse)

df %>% mutate(Type = ifelse(DiseaseA == 0, 
                            "NotDiseasedWithA", 
                            ifelse(DiseaseA == 1 & rowSums(df[, SecondaryCauses]) > 0, 
                                   "Secondary", 
                                   "Primary")))

Output

  ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE             Type
1  1        0        1        0        0        0 NotDiseasedWithA
2  2        1        0        0        0        1          Primary
3  3        1        0        1        1        0        Secondary
4  4        1        0        1        1        1        Secondary
5  5        0        0        0        0        0 NotDiseasedWithA

CodePudding user response：

using data.table

df <- structure(list(ID = 1:5, DiseaseA = c(0L, 1L, 1L, 1L, 0L), DiseaseB = c(1L, 
0L, 0L, 0L, 0L), DiseaseC = c(0L, 0L, 1L, 1L, 0L), DiseaseD = c(0L, 
0L, 1L, 1L, 0L), DiseaseE = c(0L, 1L, 0L, 1L, 0L)), row.names = c(NA, 
-5L), class = c("data.frame"))

library(data.table)

setDT(df) # make it a data.table

SecondaryCauses = c("DiseaseB", "DiseaseD")

df[DiseaseA == 0, Type := "NotDiseasedWithA"][DiseaseA == 1, Type := ifelse(rowSums(.SD) > 0, "Secondary", "Primary"), .SDcols = SecondaryCauses]

df

#    ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE             Type
# 1:  1        0        1        0        0        0 NotDiseasedWithA
# 2:  2        1        0        0        0        1          Primary
# 3:  3        1        0        1        1        0        Secondary
# 4:  4        1        0        1        1        1        Secondary
# 5:  5        0        0        0        0        0 NotDiseasedWithA