Home > other >  merge multiple datasets with different numbers of rows R
merge multiple datasets with different numbers of rows R

Time:09-01

I have 4 datasets:

df1 <- data.frame(value_1 = c(1,2,3,4), row.names = c("A", "B", "C", "D"))
df2 <- data.frame(value_2 = c(1,2,3,4,5), row.names = c("A", "B", "C", "D", "E"))
df3 <- data.frame(value_3 = c(1,2,3), row.names = c("A", "D", "E"))
df4 <- data.frame(value_4 = c(5, 6, 7, 8), row.names = c("A", "C", "D", "E"))

I need to have this output:

enter image description here

I found some similar questions, but they do not work in my case.

this:

do.call("merge", c(lapply(list(df1, df2, df3, df4), data.frame, row.names=NULL), 
               by = 0, all = TRUE))

gives an error

Error in fix.by(by.x, x) : 'by' must specify one or more columns as numbers, names or logical

this

Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3, df4))

duplicates values

CodePudding user response:

We may create a row names column in the data and use that for joining

library(dplyr)
library(tibble)
library(purrr)
library(tidyr)
list(df1, df2, df3, df4) %>%
  map(~ .x %>% rownames_to_column('rn')) %>%
  reduce(full_join, by = "rn") %>%
  mutate(across(-rn, replace_na, 0)) %>% 
  column_to_rownames('rn')

-output

  value_1 value_2 value_3 value_4
A       1       1       1       5
B       2       2       0       0
C       3       3       0       6
D       4       4       2       7
E       0       5       3       8

The by = 0 or by = "row.names" works for the first join, but after the first merge, row.name will be a column

> merge(df1, df2, by = "row.names", all = TRUE)
  Row.names value_1 value_2
1         A       1       1
2         B       2       2
3         C       3       3
4         D       4       4
5         E      NA       5

and thus it wouldn't work. We could create a column and then do the merge

Reduce(\(x, y) merge(x, y, by = 'rn', all = TRUE), 
   lapply(list(df1, df2, df3, df4), \(x) transform(x,
     rn = row.names(x))))
 rn value_1 value_2 value_3 value_4
1  A       1       1       1       5
2  B       2       2      NA      NA
3  C       3       3      NA       6
4  D       4       4       2       7
5  E      NA       5       3       8

Or in a base R |>

list(df1, df2, df3, df4) |> 
  lapply(\(x) transform(x, rn = row.names(x))) |> 
  Reduce(\(x, y) merge(x, y, all = TRUE), x = _)
  rn value_1 value_2 value_3 value_4
1  A       1       1       1       5
2  B       2       2      NA      NA
3  C       3       3      NA       6
4  D       4       4       2       7
5  E      NA       5       3       8

Or another option would be to do a join first between the first two datasets, keep it in a list, and then use by.x and by.y

list(merge(df1, df2, by = "row.names", all = TRUE), df3, df4) |> 
   Reduce(\(x, y) merge(x, y, by.x = "Row.names",
         by.y = "row.names", all = TRUE), x = _)
  Row.names value_1 value_2 value_3 value_4
1         A       1       1       1       5
2         B       2       2      NA      NA
3         C       3       3      NA       6
4         D       4       4       2       7
5         E      NA       5       3       8

If we don't want to do the first two dataset join separately, then create a function to dynamically check if the "Row.name" column exist or not, and make changes to by.x and by.y accordingly

f1 <- function(x, y)
   {
   i1 <- any(grepl("Row.names", names(x)))
    i2 <- any(grepl("Row.names", names(y)))
     nm1 <- if(i1) "Row.names"else "row.names"
      nm2 <- if(i2) "Row.names" else "row.names"
      merge(x, y, by.x = nm1, by.y = nm2 , all = TRUE)
   
   }
   
  list(df1, df2, df3, df4) |> 
      Reduce(f1, x= _)
   Row.names value_1 value_2 value_3 value_4
 1         A       1       1       1       5
 2         B       2       2      NA      NA
 3         C       3       3      NA       6
 4         D       4       4       2       7
 5         E      NA       5       3       8

CodePudding user response:

A data.table approach:

library(data.table)
Reduce(function(x, y) merge(x, y, all = TRUE), 
       lapply(list(df1, df2, df3, df4), function(y) data.table(y, keep.rownames=TRUE, key = "rn")))

   rn value_1 value_2 value_3 value_4
1:  A       1       1       1       5
2:  B       2       2      NA      NA
3:  C       3       3      NA       6
4:  D       4       4       2       7
5:  E      NA       5       3       8
  •  Tags:  
  • r
  • Related