Home > Mobile >  Replacing NAs with existing data when merging two dataframes in R
Replacing NAs with existing data when merging two dataframes in R

Time:04-28

I would like to merge two dataframes. There are some shared variables and some different variables and there are different numbers of rows in each dataframe. The dataframes share some rows, but not all. And both dataframes have missing data that the other my have.

DF1:

name age weight height
Tim 7 54 112
Dave 5 50 NA
Larry NA 42 73
Rob 1 30 43

DF2:

name age weight height grade
Tim 7 NA 112 2
Dave NA 50 103 1
Larry 3 NA 73 NA
Rob 1 30 NA NA
John 6 60 NA 1
Tom 8 61 112 2

I want to merge these two dataframes together by the shared columns (name, age, weight, and height). However, I want NAs to be overridden, such that if one of the two dataframes has a value where the other has NA, I want the value to be carried through into the third dataframe. Ideally, the last dataframe should only have NAs when both DF1 and DF2 had NAs in that same location.

Ideal Data Frame

name age weight height grade
Tim 7 54 112 2
Dave 5 50 103 1
Larry 3 42 73 NA
Rob 1 30 43 NA
John 6 60 NA 1
Tom 8 61 112 2

I've been using full_join and left_join, but I don't know how to merge these in such a way that NAs are replaced with actual data (if it is present in one of the dataframes). Is there a way to do this?

CodePudding user response:

This is a typical case that rows_patch() from dplyr can treat.

library(dplyr)

rows_patch(df2, df1, by = "name")

   name age weight height grade
1   Tim   7     54    112     2
2  Dave   5     50    103     1
3 Larry   3     42     73    NA
4   Rob   1     30     43    NA
5  John   6     60     NA     1
6   Tom   8     61    112     2

Data
df1 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob"), age = c(7L, 
5L, NA, 1L), weight = c(54L, 50L, 42L, 30L), height = c(112L, 
NA, 73L, 43L)), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(name = c("Tim", "Dave", "Larry", "Rob", "John", 
"Tom"), age = c(7L, NA, 3L, 1L, 6L, 8L), weight = c(NA, 50L, 
NA, 30L, 60L, 61L), height = c(112L, 103L, 73L, NA, NA, 112L), 
grade = c(2L, 1L, NA, NA, 1L, 2L)), class = "data.frame", row.names = c(NA, -6L))

CodePudding user response:

I like the powerjoin package suggested as an answer to the question in the first comment, which I had never heard of before.

However, if you want to avoid using extra packages, you can do it in base R. This approach also avoids having to explicitly name each column - the dplyr approaches suggested in the comments do not do that, although perhaps could be modified.

# Load data

df1  <- read.table(text = "name age weight  height
Tim 7   54  112
Dave    5   50  NA
Larry   NA  42  73
Rob 1   30  43", header=TRUE)
df2  <- read.table(text = "name age weight  height  grade
Tim 7   NA  112 2
Dave    NA  50  103 1
Larry   3   NA  73  NA
Rob 1   30  NA  NA
John    6   60  NA  1
Tom 8   61  112 2", header=TRUE)


df3  <- merge(df1, df2, by = "name", all = TRUE, sort=FALSE)

# Coalesce the common columns
common_cols  <- names(df1)[names(df1)!="name"]
df3[common_cols]  <- lapply(common_cols, function(col) {
    coalesce(df3[[paste0(col, ".x")]], df3[[paste0(col, ".y")]])
}) 

# Select desired columns
df3[names(df2)]

#    name age weight height grade
# 1   Tim   7     54    112     2
# 2  Dave   5     50    103     1
# 3 Larry   3     42     73    NA
# 4   Rob   1     30     43    NA
# 5  John   6     60     NA     1
# 6   Tom   8     61    112     2

There are advantages to using base R, but powerjoin looks like an interesting package too.

CodePudding user response:

Another possible solution:

library(tidyverse)

df2 %>% 
  bind_rows(df1) %>% 
  group_by(name) %>% 
  fill(age:grade, .direction = "updown") %>% 
  ungroup %>% 
  distinct

#> # A tibble: 6 x 5
#>   name    age weight height grade
#>   <chr> <int>  <int>  <int> <int>
#> 1 Tim       7     54    112     2
#> 2 Dave      5     50    103     1
#> 3 Larry     3     42     73    NA
#> 4 Rob       1     30     43    NA
#> 5 John      6     60     NA     1
#> 6 Tom       8     61    112     2
  • Related