Home > Software engineering >  Combining multiple columns/variables into a single column
Combining multiple columns/variables into a single column

Time:02-10

I have the following problem (I guess there is an easy answer to it, but I can't figure it out...).

I want to combine multiple columns into a single column. I have 3 variables and there are cases that answered variable 1, cases that answered variable 2, cases that answered variable 3 and cases that answered none of the variables.

Now, I want to combine them all in a single variable, that looks like column vx:

Ideal result: 

  v1 v2 v3 vx
1   1 NA NA  1
2   3 NA NA  3
3   6 NA NA  6
4  NA  5 NA  5
5  NA  1 NA  1
6  NA  3 NA  3
7  NA NA  4  4
8  NA NA  2  2
9  NA NA  1  1
10 NA NA NA NA

v1 <- c(1, 3, 6, NA, NA, NA, NA, NA, NA, NA)
v2 <- c(NA, NA, NA, 5, 1, 3, NA, NA, NA, NA)
v3 <- c(NA, NA, NA, NA, NA, NA, 4, 2, 1, NA)

df <- data.frame(v1, v2, v3)

I tried it with df$vx <- paste(df$v1, df$v2, df$v3) but then I get the following result:

My result: 

 v1 v2 v3       vx
1   1 NA NA  1 NA NA
2   3 NA NA  3 NA NA
3   6 NA NA  6 NA NA
4  NA  5 NA  NA 5 NA
5  NA  1 NA  NA 1 NA
6  NA  3 NA  NA 3 NA
7  NA NA  4  NA NA 4
8  NA NA  2  NA NA 2
9  NA NA  1  NA NA 1
10 NA NA NA NA NA NA

Can someone tell me how I get a result like the one above (ideal result) without the NAs (except if there are only NAs then I would like to have only one NA in column vx)

I hope I made clear what my issue is.

Thanks a lot!

CodePudding user response:

That is what dplyr::coalesce was made for:

library(dplyr)
df$v4 <- coalesce(!!!df)

#Also works:
df %>% 
  mutate(v4 = coalesce(v1, v2, v3))

output

   v1 v2 v3 v4
1   1 NA NA  1
2   3 NA NA  3
3   6 NA NA  6
4  NA  5 NA  5
5  NA  1 NA  1
6  NA  3 NA  3
7  NA NA  4  4
8  NA NA  2  2
9  NA NA  1  1
10 NA NA NA NA

CodePudding user response:

Using apply()

# Your data.
v1 = c(1, 3, 6, NA, NA, NA, NA, NA, NA, NA)
v2 = c(NA, NA, NA, 5, 1, 3, NA, NA, NA, NA)
v3 = c(NA, NA, NA, NA, NA, NA, 4, 2, 1, NA)

df = data.frame(v1, v2, v3)
df

# Solution: writing a function to be passed in apply().
useful.function = function(x)
{
  # The input "x" is a row of a dataframe.
  
  # If all the values are NA, return NA.
  if(sum(!is.na(x)) == 0)
    return(NA)
  
  # Otherwise, return the non-NA value.
  return(x[!is.na(x)])
}

df$vx = apply(df, MARGIN = 1, useful.function)
df

Clearly, other solutions may be faster and require less coding (as those relying on the dplyr package, posted by @Maël). However, I really suggest you to get confident in using apply() and the other functions from the same family (see lapply() and sapply()), as they are really flexible (and sometimes you may not be aware of the existence of a certain function or package).

CodePudding user response:

Using max.col in base R -

df$vx <- df[cbind(1:nrow(df), max.col(!is.na(df)))]
df

#   v1 v2 v3 vx
#1   1 NA NA  1
#2   3 NA NA  3
#3   6 NA NA  6
#4  NA  5 NA  5
#5  NA  1 NA  1
#6  NA  3 NA  3
#7  NA NA  4  4
#8  NA NA  2  2
#9  NA NA  1  1
#10 NA NA NA NA

max.col returns the index of max value in each row. With !is.na(df) we'll get the index of TRUE value in each row because TRUE > FALSE. We create a matrix with cbind to extract the max value from each row.

  • Related