I have the following problem (I guess there is an easy answer to it, but I can't figure it out...).
I want to combine multiple columns into a single column. I have 3 variables and there are cases that answered variable 1, cases that answered variable 2, cases that answered variable 3 and cases that answered none of the variables.
Now, I want to combine them all in a single variable, that looks like column vx
:
Ideal result:
v1 v2 v3 vx
1 1 NA NA 1
2 3 NA NA 3
3 6 NA NA 6
4 NA 5 NA 5
5 NA 1 NA 1
6 NA 3 NA 3
7 NA NA 4 4
8 NA NA 2 2
9 NA NA 1 1
10 NA NA NA NA
v1 <- c(1, 3, 6, NA, NA, NA, NA, NA, NA, NA)
v2 <- c(NA, NA, NA, 5, 1, 3, NA, NA, NA, NA)
v3 <- c(NA, NA, NA, NA, NA, NA, 4, 2, 1, NA)
df <- data.frame(v1, v2, v3)
I tried it with df$vx <- paste(df$v1, df$v2, df$v3)
but then I get the following result:
My result:
v1 v2 v3 vx
1 1 NA NA 1 NA NA
2 3 NA NA 3 NA NA
3 6 NA NA 6 NA NA
4 NA 5 NA NA 5 NA
5 NA 1 NA NA 1 NA
6 NA 3 NA NA 3 NA
7 NA NA 4 NA NA 4
8 NA NA 2 NA NA 2
9 NA NA 1 NA NA 1
10 NA NA NA NA NA NA
Can someone tell me how I get a result like the one above (ideal result) without the NAs
(except if there are only NAs
then I would like to have only one NA
in column vx
)
I hope I made clear what my issue is.
Thanks a lot!
CodePudding user response:
That is what dplyr::coalesce
was made for:
library(dplyr)
df$v4 <- coalesce(!!!df)
#Also works:
df %>%
mutate(v4 = coalesce(v1, v2, v3))
output
v1 v2 v3 v4
1 1 NA NA 1
2 3 NA NA 3
3 6 NA NA 6
4 NA 5 NA 5
5 NA 1 NA 1
6 NA 3 NA 3
7 NA NA 4 4
8 NA NA 2 2
9 NA NA 1 1
10 NA NA NA NA
CodePudding user response:
Using apply()
# Your data.
v1 = c(1, 3, 6, NA, NA, NA, NA, NA, NA, NA)
v2 = c(NA, NA, NA, 5, 1, 3, NA, NA, NA, NA)
v3 = c(NA, NA, NA, NA, NA, NA, 4, 2, 1, NA)
df = data.frame(v1, v2, v3)
df
# Solution: writing a function to be passed in apply().
useful.function = function(x)
{
# The input "x" is a row of a dataframe.
# If all the values are NA, return NA.
if(sum(!is.na(x)) == 0)
return(NA)
# Otherwise, return the non-NA value.
return(x[!is.na(x)])
}
df$vx = apply(df, MARGIN = 1, useful.function)
df
Clearly, other solutions may be faster and require less coding (as those relying on the dplyr
package, posted by @Maël). However, I really suggest you to get confident in using apply()
and the other functions from the same family (see lapply()
and sapply()
), as they are really flexible (and sometimes you may not be aware of the existence of a certain function or package).
CodePudding user response:
Using max.col
in base R -
df$vx <- df[cbind(1:nrow(df), max.col(!is.na(df)))]
df
# v1 v2 v3 vx
#1 1 NA NA 1
#2 3 NA NA 3
#3 6 NA NA 6
#4 NA 5 NA 5
#5 NA 1 NA 1
#6 NA 3 NA 3
#7 NA NA 4 4
#8 NA NA 2 2
#9 NA NA 1 1
#10 NA NA NA NA
max.col
returns the index of max value in each row. With !is.na(df)
we'll get the index of TRUE
value in each row because TRUE > FALSE
. We create a matrix with cbind
to extract the max value from each row.