Home > other >  How to join(merge) multiple data frames and keep column names in R
How to join(merge) multiple data frames and keep column names in R

Time:08-03

I have four data frames:

df01 <- data.frame(ID = c("001","002","003","004"),
                      Name = c("Ben","Jennifer","Mark","Brad"),
                      LastName = c("Affleck","Lopez","Anthony","Pitt"))
df02 <- data.frame(ID = c("001","002"),
                      Age = c(37,41))
df03 <- data.frame(ID = c("003"),
                      Age = c(28))
df04 <- data.frame(ID = c("004"),
                      Age = c(48))

I am trying to join using dplyr package with the function left_join like this:

df <- df01 %>% 
  left_join(df02, by = "ID") %>% 
  left_join(df03, by = "ID") %>% 
  left_join(df04, by = "ID")

And my current outcome is

> df
   ID     Name LastName Age.x Age.y Age
1 001      Ben  Affleck    37    NA  NA
2 002 Jennifer    Lopez    41    NA  NA
3 003     Mark  Anthony    NA    28  NA
4 004     Brad     Pitt    NA    NA  48

But my expected outcome would be:

> df
   ID     Name LastName Age
1 001      Ben  Affleck  37
2 002 Jennifer    Lopez  41
3 003     Mark  Anthony  28
4 004     Brad     Pitt  48

I would like to say, this is a very simplified issue because one solution would be binding rows and next applying left_join like this

dfx <- bind_rows(df02,df03,df04)

df <- df01 %>% 
  left_join(dfx, by = "ID")

but the real issue includes larger-than-memory and applying that solution would do an error called "Error: cannot allocate vector of size ..."

Thank you very much for your help.

CodePudding user response:

Here's a use of Reduce (or you can use purrr::reduce, effectively the same thing):

fun <- function(a, b) {
  out <- left_join(a, b, by = "ID", suffix = c("", ".y"))
  if (all(c("Age", "Age.y") %in% names(out))) {
    out <- mutate(out, Age = coalesce(Age.y, Age)) %>%
      select(-Age.y)
  }
  out
}
Reduce(fun, list(df01, df02, df03, df04))
#    ID     Name LastName Age
# 1 001      Ben  Affleck  37
# 2 002 Jennifer    Lopez  41
# 3 003     Mark  Anthony  28
# 4 004     Brad     Pitt  48

Quick walk-through:

  • Reduce calls the function (fun here) on the first two elements of the list provided; it then calls with that return value and the 3rd element in the list; then calls with that return value and the 4th; until the list is exhausted
  • coalesce is a great function that returns the first non-NA value of the values provided; and it's vectorized, which is great; try coalesce(c(1,NA,3), c(22,33,44)) and get c(1,33,3).

CodePudding user response:

As you have only one value for each new Age column. You could sum all of them after the left joins.

df <- df01 %>% 
  left_join(df02, by = "ID") %>% 
  left_join(df03, by = "ID") %>% 
  left_join(df04, by = "ID") %>% 
  mutate(Age = sum(c_across(stringr::str_subset(colnames(.), "Age")), na.rm = TRUE)) %>% 
  select(-stringr::str_subset(colnames(.), "Age."))

With stringr package you can select all columns that have "Age" in their name. Then you could do the same to remove all columns with "Age."

  • Related