Home > database >  Merging different data frames in R to eliminate NAs
Merging different data frames in R to eliminate NAs

Time:03-19

I'm currently working on a longitudinal data base in R. Therefore, I have a lot of missing values, because the values of the variables which have been unchanged since the last interview are not added in the new database. For example in the first wave the sex is defined as boy or girl and it doesn't change between the first wave and the second wave, so they are not giving the sex in the second wave again.

Basically, what I would like to do is to merge the data I have selected for the second wave and merge it with the data from the first wave, in order to eliminate some NAs. However, I would like to only keep the columns I have selected from the second wave. For the moment, and after looking on the internet, I was only able to merge the two datasets but I'm not able to only keep the data from the second wave.

Here is my code:

library("rqdatatable")

x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)


y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)

final <- natural_join(x, y, 
                   by = "ID",
                   jointype = "LEFT")

What I would like to get after my merge is:

    z = data.frame(
  ID = c(1,2,3,4),
  S1 = c(1, 3, 5,0),
  S2 = c(2, 0, 2,2)
)

Do you have any idea of how I can solve my problem? It would be very time consuming to merge everything and to select the variables I want again.

Many thanks and best regards!

CodePudding user response:

We could use inner_join in combination with coalesce

library(dplyr)

x %>% 
  inner_join(y, by="ID") %>% 
  mutate(S1 = coalesce(S1.x, S1.y),
         S2 = coalesce(S2.x, S2.y)) %>% 
  select(ID, S1, S2)
  
  ID S1 S2
1  1  1  2
2  2  3  0
3  3  5  2
4  4  0  2

CodePudding user response:

Here is a base r function that joins the data like in the question. It can also be call via a pipe, in this case R's pipe operator introduced in R 4.1.

x <- data.frame(
  ID = c(1,2,3,4),
  S1 = c(1, 3, NA,0),
  S2 = c(2, NA, 2,2)
)

y <- data.frame(
  ID = c(1, 2, 3, 4,5,6,7,8),
  S1 = c(1, 2, 5, 1,3,6,8,2),
  S3 = c(3, 3, 3, 3,7,1,6,9),
  S2 = c(0,0,0,0,0,0,0,0),
  S4 = c(0,0,0,0,0,0,0,0)
)


joinSpecial <- function(x, y, idcol = "ID"){
  idcolx <- which(names(x) == idcol)
  idcoly <- which(names(y) == idcol)
  idx <- which(names(x) %in% names(y))
  idy <- which(names(y) %in% names(x))
  idx <- idx[idx != idcolx]
  idy <- idy[idy != idcoly]
  i <- match(x[[idcolx]], y[[idcoly]])
  x[idx] <- mapply(\(a, b, i){
    na <- is.na(a)
    a[na] <- b[i][na]
    a
  }, x[idx], y[idy], MoreArgs = list(i = i), SIMPLIFY = FALSE)
  x
}

joinSpecial(x, y)
#>   ID S1 S2
#> 1  1  1  2
#> 2  2  3  0
#> 3  3  5  2
#> 4  4  0  2

x |> joinSpecial(y)
#>   ID S1 S2
#> 1  1  1  2
#> 2  2  3  0
#> 3  3  5  2
#> 4  4  0  2

Created on 2022-03-18 by the reprex package (v2.0.1)

  • Related