I have a dataframe with missing data for some rows. The correct data can be found in another set of columns. I want to replace the NAs with the correct data.
My data looks like this:
df <- data.frame(M_1=c(1,NA,3,NA,6),
M_2=c(5,NA,3,NA,1),
M_3=c(6,NA,2,NA,4),
M_C_1=c(NA,2,NA,6,NA),
M_C_2=c(NA,1,NA,4,NA),
M_C_3=c(NA,7,NA,3,NA))
df
# M_1 M_2 M_3 M_C_1 M_C_2 M_C_3
#1 1 5 6 NA NA NA
#2 NA NA NA 2 1 7
#3 3 3 2 NA NA NA
#4 NA NA NA 6 4 3
#5 6 1 4 NA NA NA
For all records, I either have a complete set of records for variables M_1, M_2, and M_3 or I have a complete set for variables M_C_1, M_C_2, and M_C_3.
For each row that has NAs in the first set of variables (M_1:M_3), I would like to replace with the values from the second set of values (M_C_1:M_C_2). I dont need to retain the second set of values. So my desired data frame would look like:
df
# M_1 M_2 M_3
#1 1 5 6
#2 2 1 7
#3 3 3 2
#4 6 4 3
#5 6 1 4
My real dataset contains many columns in this notation, so I need a general solutions (ie, I dont want to refer to each column individually).
I would like to do this with dplyr if possible.
CodePudding user response:
You could use map2
coalesce
:
library(dplyr)
library(purrr)
map2_dfc(select(df, 1:3), select(df, 4:6), coalesce)
# # A tibble: 5 × 3
# M_1 M_2 M_3
# <dbl> <dbl> <dbl>
# 1 1 5 6
# 2 2 1 7
# 3 3 3 2
# 4 6 4 3
# 5 6 1 4
CodePudding user response:
Here's a generic example that works if the columns have names that allow them to be identified and they are in the correct order.
library(dplyr)
df <- data.frame(M_1=c(1,NA,3,NA,6),
M_2=c(5,NA,3,NA,1),
M_3=c(6,NA,2,NA,4),
M_C_1=c(NA,2,NA,6,NA),
M_C_2=c(NA,1,NA,4,NA),
M_C_3=c(NA,7,NA,3,NA))
# make a temporary id so we can retain the order later
df <- df %>% mutate(temporary_id = 1:n())
# find the columns corresponding to the final data
# they are assumed to be of the form M_number
df_records <-
df %>%
select(matches('temporary_id|M_[0-9] ')) %>%
na.omit()
# find the extra columns with data to replace in the final data
# they are assumed to be of the form M_C_number
df_extra <-
df %>%
select(matches('temporary_id|M_C_[0-9] ')) %>%
na.omit()
# change the names of the extra columns to match the final data
# this only works if the columns are in the correct order in the original data frame
names(df_extra) <- names(df_records)
# bind the rows of the final and extra data, sort and remove the temporary id
final_df <-
df_records %>%
bind_rows(df_extra) %>%
arrange(temporary_id) %>%
select(-temporary_id)
final_df
# M_1 M_2 M_3
#1 1 5 6
#2 2 1 7
#3 3 3 2
#4 6 4 3
#5 6 1 4
If they are not in the required order, some sorting could be done but I'll leave that for now.