create a column with values based on other columns in R-CodePudding

I want to create a variable that takes its value from 5 binary (Y/N) columns so that each "YES" counts as 1 and each "NO" as 0, so every observation will have a value for this new variable between 0 and 5.

The way that I am currently thinking about it is to create a new variable with all values at 0 and to run something like this for each of the determining variables:

new_variable <- if (variable_x == "YES"){increment(new_variable)}

but this seems a bit inelegant. Does anyone know if there is a better way to do this?

CodePudding user response：

The way R is designed, every TRUE is already a 1 and every FALSE is already a 0. So if you want to add, how many TRUE there are in c(TRUE, FALSE, TRUE) you can just sum that:

sum(c(TRUE, FALSE, TRUE))

So in most cases, you really do not need to adjust anything. If you really, really wanted to do so, as.numeric does the job:

as.numeric(c(TRUE, FALSE, TRUE)

It pays to code you binary values as boolean. In your case, you could do that on the fly using == as in

sum(c("YES", "NO", "YES", "YES", "NO") == "YES")

If there are any NA in your data, you will have to make a decision and probably want to use sums na.rm argument

sum(c("YES", "YES", NA, "NO", "NO", NA) == "YES", na.rm = TRUE)

As I understand your question, you might consider using rowSums instead of sum.

CodePudding user response：

I think this should suffice...

library(tidyverse)

# creating dataframe
df = data.frame(col1 = sample(c("YES","NO"),10,replace = T),
                col2 = sample(c("YES","NO"),10,replace = T),
                col3 = sample(c("YES","NO"),10,replace = T),
                col4 = sample(c("YES","NO"),10,replace = T),
                col5 = sample(c("YES","NO"),10,replace = T))


apply(df,2,function(x) as.numeric(x == "YES")) %>% # converting YES/NO to binary
  as.data.frame() %>% # changing from matrix to dataframe
  mutate(sum = rowSums(across(where(is.numeric)), na.rm = T)) # creating sum


#>    col1 col2 col3 col4 col5 sum
#> 1     1    0    0    1    0   2
#> 2     1    0    1    1    1   4
#> 3     0    0    0    0    1   1
#> 4     1    1    0    1    0   3
#> 5     0    0    1    0    0   1
#> 6     0    0    0    0    1   1
#> 7     0    1    0    1    0   2
#> 8     1    0    1    0    0   2
#> 9     1    1    0    1    0   3
#> 10    1    1    1    0    0   3

CodePudding user response：

You can also use rowSums() directly in a dplyr statement.

library(dplyr)

df %>% 
  mutate(x = rowSums(across(all_of(c("col1", "col2", "col3", "col4", "col5")), `==`, "YES")))

#      something col1 col2 col3 col4 col5 x
# 1  0.113703411   NO   NO  YES   NO   NO 1
# 2  0.622299405   NO   NO   NO  YES   NO 1
# 3  0.609274733   NO   NO  YES   NO   NO 1
# 4  0.623379442  YES   NO  YES   NO   NO 2
# 5  0.860915384   NO   NO  YES   NO  YES 2
# 6  0.640310605   NO   NO   NO  YES   NO 1
# 7  0.009495756   NO  YES  YES  YES  YES 4
# 8  0.232550506  YES   NO   NO  YES  YES 3
# 9  0.666083758   NO   NO   NO  YES   NO 1
# 10 0.514251141   NO   NO  YES   NO  YES 2

You can also use the tidyselect helpers depending on how your data frame is set up.

df %>% 
  mutate(x = rowSums(across(starts_with("col"), `==`, "YES")))

df %>% 
  mutate(x = rowSums(across(col1:col5, `==`, "YES")))

Regarding using na.rm = TRUE. I would not do that. How to handle NA's should require more careful thought. One should not simply assume that an NA is equal to a "NO" without knowing your problem better. If it is a "NO", then I would replace that value at some stage in the workflow instead.

Data:

set.seed(1234)

df <- data.frame(something = runif(10),
                 col1 = sample(c("YES","NO"),10,replace = T),
                 col2 = sample(c("YES","NO"),10,replace = T),
                 col3 = sample(c("YES","NO"),10,replace = T),
                 col4 = sample(c("YES","NO"),10,replace = T),
                 col5 = sample(c("YES","NO"),10,replace = T))