Joining two incomplete data.tables with the same column names-CodePudding

I have two incomplete data.tables with the same column names.

dt1 <- data.table(id = c(1, 2, 3), v1 = c("w", "x", NA), v2 = c("a", NA, "c"))
dt2 <- data.table(id = c(2, 3, 4), v1 = c(NA, "y", "z"), v2 = c("b", "c", NA))

They look like this:

dt1
   id   v1   v2
1:  1    w    a
2:  2    x <NA>
3:  3 <NA>    c

> dt2
   id   v1   v2
1:  2 <NA>    b
2:  3    y    c
3:  4    z <NA>

Is there a way to merge the two by filling in the missing info?

This is the result I'm after:

   id v1   v2
1:  1  w    a
2:  2  x    b
3:  3  y    c
4:  4  z <NA>

I've tried various data.table joins, merges but I either get the columns repeated:

> merge(dt1,
        dt2,
        by = "id",
        all = TRUE)

   id v1.x v2.x v1.y v2.y
1:  1    w    a <NA> <NA>
2:  2    x <NA> <NA>    b
3:  3 <NA>    c    y    c
4:  4 <NA> <NA>    z <NA>

or the rows repeated:

> merge(dt1,
        dt2,
        by = names(dt1),
        all = TRUE)
   id   v1   v2
1:  1    w    a
2:  2 <NA>    b
3:  2    x <NA>
4:  3 <NA>    c
5:  3    y    c
6:  4    z <NA>

Both data.tables have the same column names.

CodePudding user response：

You could also start out with rbind():

rbind(dt1, dt2)[, lapply(.SD, \(x) unique(x[!is.na(x)])), by = id]

#       id     v1     v2
#    <num> <char> <char>
# 1:     1      w      a
# 2:     2      x      b
# 3:     3      y      c
# 4:     4      z   <NA>

CodePudding user response：

You can group by ID and get the unique values after omitting NAs, i.e.

library(data.table)

merge(dt1, dt2, all = TRUE)[, 
        lapply(.SD, function(i)na.omit(unique(i))), 
                            by = id][]

#   id v1   v2
#1:  1  w    a
#2:  2  x    b
#3:  3  y    c
#4:  4  z <NA>

CodePudding user response：

First full_join and after that group_by per id and merge the rows:

library(dplyr)
library(tidyr)
dt1 %>%
  full_join(dt2, by = c("id", "v1", "v2")) %>%
  group_by(id) %>%
  fill(starts_with('v'), .direction = 'updown') %>%
  slice(1) %>%
  ungroup

Output:

# A tibble: 4 × 3
     id v1    v2   
  <dbl> <chr> <chr>
1     1 w     a    
2     2 x     b    
3     3 y     c    
4     4 z     NA