How to make a function that creates a column with combined observations-CodePudding

I am obviously new to data cleaning and I am having trouble cleaning a survey export. This is how my data frame looks in raw form.

Var1          Colname1  Colname2  Colname3  Var2
Observation1  NA        NA        Val1      Val_1
Observation2  NA        Val2      NA        Val_1
Observation3  Val3      NA        NA        Val_1
Observation4  Val4      Val5      NA        Val_2
Observation5  NA        NA        Val6      Val_2

I would like to have my data cleaned to look like this:

Var1         SubVar1 Var2
Observation1 Val1    Val_1
Observation2 Val2    Val_1
Observation3 Val3    Val_1
Observation4 Val4    Val_2
Observation4 Val5    Val_2
Observation5 Val6    Val_2

I have tried to remove NA values:

df1 <- na.omit(c(Colname1, Colname2, Colname3))

The problem is that it will delete all rows because there is an NA in every row. I have also tried to concatenate the values and then use the separate_rows() function, but that will only work with observations that only have one value in one column. For observations that contain values in multiple columns (see Observation4), this will not work.

Thanks for any help you guys can provide!

CodePudding user response：

Try,

data %>% mutate(SubVar1 = coalesce(Colname1,Colname2,Colname3)) %>%
         select(Var1, SubVar1, Var2)

CodePudding user response：

I would think of this as a pivot (reshaping) operation from wide to long:

library(dplyr)
library(tidyr)

data %>%
  pivot_longer(cols = Colname1:Colname3, values_to = "SubVar1") %>%
  filter(!is.na(SubVar1)) %>%
  select(Var1, SubVar1, Var2)
# # A tibble: 6 × 3
#   Var1         SubVar1 Var2 
#   <chr>        <chr>   <chr>
# 1 Observation1 Val1    Val_1
# 2 Observation2 Val2    Val_1
# 3 Observation3 Val3    Val_1
# 4 Observation4 Val4    Val_2
# 5 Observation4 Val5    Val_2
# 6 Observation5 Val6    Val_2

To understand what's happening, run the first line, then the first and second line, then the first, second and third line, etc. See ?pivot_longer for several other options in specifying which columns to pivot - you could name the explicitly, use a name pattern like names_pattern = "Colname" or use the Colname1:Colname3 to select consecutive columns as I did above.

CodePudding user response：

We can use base R in a vectorized way with row/column indexing. Subset the columns where the column names are 'Colname', then get the column index of non-NA element for each row with max.col, cbind the row sequence, extract the corresponding element and create the new data.frame

i1 <- startsWith(names(df1), "Colname")
data.frame(df1['Var1'], SubVar1 = df1[i1][cbind(seq_len(nrow(df1)), 
      max.col(!is.na(df1[i1]), "first"))], df1['Var2'])
          Var1 SubVar1  Var2
1 Observation1    Val1 Val_1
2 Observation2    Val2 Val_1
3 Observation3    Val3 Val_1
4 Observation4    Val4 Val_2
5 Observation5    Val6 Val_2

data

df1 <- structure(list(Var1 = c("Observation1", "Observation2", "Observation3", 
"Observation4", "Observation5"), Colname1 = c(NA, NA, "Val3", 
"Val4", NA), Colname2 = c(NA, "Val2", NA, "Val5", NA), Colname3 = c("Val1", 
NA, NA, NA, "Val6"), Var2 = c("Val_1", "Val_1", "Val_1", "Val_2", 
"Val_2")), class = "data.frame", row.names = c(NA, -5L))