I have four data frames:
df01 <- data.frame(ID = c("001","002","003","004"),
Name = c("Ben","Jennifer","Mark","Brad"),
LastName = c("Affleck","Lopez","Anthony","Pitt"))
df02 <- data.frame(ID = c("001","002"),
Age = c(37,41))
df03 <- data.frame(ID = c("003"),
Age = c(28))
df04 <- data.frame(ID = c("004"),
Age = c(48))
I am trying to join using dplyr package with the function left_join like this:
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID")
And my current outcome is
> df
ID Name LastName Age.x Age.y Age
1 001 Ben Affleck 37 NA NA
2 002 Jennifer Lopez 41 NA NA
3 003 Mark Anthony NA 28 NA
4 004 Brad Pitt NA NA 48
But my expected outcome would be:
> df
ID Name LastName Age
1 001 Ben Affleck 37
2 002 Jennifer Lopez 41
3 003 Mark Anthony 28
4 004 Brad Pitt 48
I would like to say, this is a very simplified issue because one solution would be binding rows and next applying left_join like this
dfx <- bind_rows(df02,df03,df04)
df <- df01 %>%
left_join(dfx, by = "ID")
but the real issue includes larger-than-memory and applying that solution would do an error called "Error: cannot allocate vector of size ..."
Thank you very much for your help.
CodePudding user response:
Here's a use of Reduce
(or you can use purrr::reduce
, effectively the same thing):
fun <- function(a, b) {
out <- left_join(a, b, by = "ID", suffix = c("", ".y"))
if (all(c("Age", "Age.y") %in% names(out))) {
out <- mutate(out, Age = coalesce(Age.y, Age)) %>%
select(-Age.y)
}
out
}
Reduce(fun, list(df01, df02, df03, df04))
# ID Name LastName Age
# 1 001 Ben Affleck 37
# 2 002 Jennifer Lopez 41
# 3 003 Mark Anthony 28
# 4 004 Brad Pitt 48
Quick walk-through:
Reduce
calls the function (fun
here) on the first two elements of the list provided; it then calls with that return value and the 3rd element in the list; then calls with that return value and the 4th; until the list is exhaustedcoalesce
is a great function that returns the first non-NA
value of the values provided; and it's vectorized, which is great; trycoalesce(c(1,NA,3), c(22,33,44))
and getc(1,33,3)
.
CodePudding user response:
As you have only one value for each new Age
column. You could sum all of them after the left joins.
df <- df01 %>%
left_join(df02, by = "ID") %>%
left_join(df03, by = "ID") %>%
left_join(df04, by = "ID") %>%
mutate(Age = sum(c_across(stringr::str_subset(colnames(.), "Age")), na.rm = TRUE)) %>%
select(-stringr::str_subset(colnames(.), "Age."))
With stringr
package you can select all columns that have "Age" in their name. Then you could do the same to remove all columns with "Age."