I want to create a variable that takes its value from 5 binary (Y/N) columns so that each "YES" counts as 1 and each "NO" as 0, so every observation will have a value for this new variable between 0 and 5.
The way that I am currently thinking about it is to create a new variable with all values at 0 and to run something like this for each of the determining variables:
new_variable <- if (variable_x == "YES"){increment(new_variable)}
but this seems a bit inelegant. Does anyone know if there is a better way to do this?
CodePudding user response:
The way R is designed, every TRUE
is already a 1
and every FALSE
is already a 0
. So if you want to add, how many TRUE
there are in c(TRUE, FALSE, TRUE)
you can just sum that:
sum(c(TRUE, FALSE, TRUE))
So in most cases, you really do not need to adjust anything. If you really, really wanted to do so, as.numeric
does the job:
as.numeric(c(TRUE, FALSE, TRUE)
It pays to code you binary values as boolean. In your case, you could do that on the fly using ==
as in
sum(c("YES", "NO", "YES", "YES", "NO") == "YES")
If there are any NA
in your data, you will have to make a decision and probably want to use sum
s na.rm
argument
sum(c("YES", "YES", NA, "NO", "NO", NA) == "YES", na.rm = TRUE)
As I understand your question, you might consider using rowSums
instead of sum
.
CodePudding user response:
I think this should suffice...
library(tidyverse)
# creating dataframe
df = data.frame(col1 = sample(c("YES","NO"),10,replace = T),
col2 = sample(c("YES","NO"),10,replace = T),
col3 = sample(c("YES","NO"),10,replace = T),
col4 = sample(c("YES","NO"),10,replace = T),
col5 = sample(c("YES","NO"),10,replace = T))
apply(df,2,function(x) as.numeric(x == "YES")) %>% # converting YES/NO to binary
as.data.frame() %>% # changing from matrix to dataframe
mutate(sum = rowSums(across(where(is.numeric)), na.rm = T)) # creating sum
#> col1 col2 col3 col4 col5 sum
#> 1 1 0 0 1 0 2
#> 2 1 0 1 1 1 4
#> 3 0 0 0 0 1 1
#> 4 1 1 0 1 0 3
#> 5 0 0 1 0 0 1
#> 6 0 0 0 0 1 1
#> 7 0 1 0 1 0 2
#> 8 1 0 1 0 0 2
#> 9 1 1 0 1 0 3
#> 10 1 1 1 0 0 3
CodePudding user response:
You can also use rowSums()
directly in a dplyr
statement.
library(dplyr)
df %>%
mutate(x = rowSums(across(all_of(c("col1", "col2", "col3", "col4", "col5")), `==`, "YES")))
# something col1 col2 col3 col4 col5 x
# 1 0.113703411 NO NO YES NO NO 1
# 2 0.622299405 NO NO NO YES NO 1
# 3 0.609274733 NO NO YES NO NO 1
# 4 0.623379442 YES NO YES NO NO 2
# 5 0.860915384 NO NO YES NO YES 2
# 6 0.640310605 NO NO NO YES NO 1
# 7 0.009495756 NO YES YES YES YES 4
# 8 0.232550506 YES NO NO YES YES 3
# 9 0.666083758 NO NO NO YES NO 1
# 10 0.514251141 NO NO YES NO YES 2
You can also use the tidyselect helpers depending on how your data frame is set up.
df %>%
mutate(x = rowSums(across(starts_with("col"), `==`, "YES")))
df %>%
mutate(x = rowSums(across(col1:col5, `==`, "YES")))
Regarding using na.rm = TRUE
. I would not do that. How to handle NA's should require more careful thought. One should not simply assume that an NA is equal to a "NO" without knowing your problem better. If it is a "NO", then I would replace that value at some stage in the workflow instead.
Data:
set.seed(1234)
df <- data.frame(something = runif(10),
col1 = sample(c("YES","NO"),10,replace = T),
col2 = sample(c("YES","NO"),10,replace = T),
col3 = sample(c("YES","NO"),10,replace = T),
col4 = sample(c("YES","NO"),10,replace = T),
col5 = sample(c("YES","NO"),10,replace = T))