I am obviously new to data cleaning and I am having trouble cleaning a survey export. This is how my data frame looks in raw form.
Var1 Colname1 Colname2 Colname3 Var2
Observation1 NA NA Val1 Val_1
Observation2 NA Val2 NA Val_1
Observation3 Val3 NA NA Val_1
Observation4 Val4 Val5 NA Val_2
Observation5 NA NA Val6 Val_2
I would like to have my data cleaned to look like this:
Var1 SubVar1 Var2
Observation1 Val1 Val_1
Observation2 Val2 Val_1
Observation3 Val3 Val_1
Observation4 Val4 Val_2
Observation4 Val5 Val_2
Observation5 Val6 Val_2
I have tried to remove NA values:
df1 <- na.omit(c(Colname1, Colname2, Colname3))
The problem is that it will delete all rows because there is an NA in every row. I have also tried to concatenate the values and then use the separate_rows() function, but that will only work with observations that only have one value in one column. For observations that contain values in multiple columns (see Observation4), this will not work.
Thanks for any help you guys can provide!
CodePudding user response:
Try,
data %>% mutate(SubVar1 = coalesce(Colname1,Colname2,Colname3)) %>%
select(Var1, SubVar1, Var2)
CodePudding user response:
I would think of this as a pivot (reshaping) operation from wide to long:
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = Colname1:Colname3, values_to = "SubVar1") %>%
filter(!is.na(SubVar1)) %>%
select(Var1, SubVar1, Var2)
# # A tibble: 6 × 3
# Var1 SubVar1 Var2
# <chr> <chr> <chr>
# 1 Observation1 Val1 Val_1
# 2 Observation2 Val2 Val_1
# 3 Observation3 Val3 Val_1
# 4 Observation4 Val4 Val_2
# 5 Observation4 Val5 Val_2
# 6 Observation5 Val6 Val_2
To understand what's happening, run the first line, then the first and second line, then the first, second and third line, etc. See ?pivot_longer
for several other options in specifying which columns to pivot - you could name the explicitly, use a name pattern like names_pattern = "Colname"
or use the Colname1:Colname3
to select consecutive columns as I did above.
CodePudding user response:
We can use base R
in a vectorized way with row/column indexing. Subset the columns where the column names are 'Colname', then get the column index of non-NA element for each row with max.col
, cbind
the row sequence, extract the corresponding element and create the new data.frame
i1 <- startsWith(names(df1), "Colname")
data.frame(df1['Var1'], SubVar1 = df1[i1][cbind(seq_len(nrow(df1)),
max.col(!is.na(df1[i1]), "first"))], df1['Var2'])
Var1 SubVar1 Var2
1 Observation1 Val1 Val_1
2 Observation2 Val2 Val_1
3 Observation3 Val3 Val_1
4 Observation4 Val4 Val_2
5 Observation5 Val6 Val_2
data
df1 <- structure(list(Var1 = c("Observation1", "Observation2", "Observation3",
"Observation4", "Observation5"), Colname1 = c(NA, NA, "Val3",
"Val4", NA), Colname2 = c(NA, "Val2", NA, "Val5", NA), Colname3 = c("Val1",
NA, NA, NA, "Val6"), Var2 = c("Val_1", "Val_1", "Val_1", "Val_2",
"Val_2")), class = "data.frame", row.names = c(NA, -5L))