I am trying to join a list of tibbles from various datasources. Basically I have a list of journals for which I am trying to add some information from other sources.
One of the purposes to do this is to fill out missing data in some columns which are also being used for joining. For the sake of example, I have the following two datasets that resemble the structure of my data.
df1 <- tibble(journal_title = c(NA,
"Journal of yyy",
"Journal of zzz"),
issn = c(9999, 1234, NA))
df2 <- tibble(journal_title = c("Journal of xxx", NA, "Journal of zzz"),
issn = c(9999, 1234, 8888),
rank = c(1,2,3))
> df1
# A tibble: 3 × 2
journal_title issn
<chr> <dbl>
1 NA 9999
2 Journal of yyy 1234
3 Journal of zzz NA
> df2
# A tibble: 3 × 3
journal_title issn rank
<chr> <dbl> <dbl>
1 Journal of xxx 9999 1
2 NA 1234 2
3 Journal of zzz 8888 3
I wish to join the two datasets and basically carry out a left join where x = df1 and y = df2, i.e. df1 is the main data to which i want to add columns from df2.
However, as shown in the data, there are two ID columns which should be used for the join function. The issue is that there might be NA in one of the columns. Therefore, using by = c("issn", "journal_title")
doesnt work.
Therefore i want to:
Join the datasets by both columns (
issn
andjournal_title
), seeing as there might be NA in one of them. I want to keep issn as the "first try", and then if there is not a match in df2,journal_title
should be used.Fill out the NAs with values from the two datasets.
I have tried making a "synthetic" ID-column by coalescing the two columns with x = issn
and y = journal_title
. However, this doesn't work as it doesn't factor in that in some cases, for example, both issn
and journal_title
is present the first dataset, whereas for the corresponding issn
, only journal_title
is present in the second dataset.
My goal data looks like this:
df3 <- tibble(journal_title = c("Journal of xxx", "Journal of yyy", "Journal of zzz"),
issn = c(9999, 1234, 8888),
rank = c(1,2,3))
> df3
# A tibble: 3 × 3
journal_title issn rank
<chr> <dbl> <dbl>
1 Journal of xxx 9999 1
2 Journal of yyy 1234 2
3 Journal of zzz 8888 3
I hope I have made myself clear, and any help is appreciated!
CodePudding user response:
Maybe you can start from here. I do not know if it covers all your cases, but it covers the cases you've published:
library(dplyr)
df2 %>%
left_join(df1, by ='issn') %>%
mutate(journal_title = ifelse(is.na(journal_title.x), journal_title.y, journal_title.x)) %>%
select(journal_title, issn, rank)
# A tibble: 3 x 3
journal_title issn rank
<chr> <dbl> <dbl>
1 Journal of xxx 9999 1
2 Journal of yyy 1234 2
3 Journal of zzz 8888 3
This solution covers if you have some NA
in df2$issn
and the issn
is in df1
:
df2 %>%
left_join(df1, by ='issn') %>%
mutate(journal_title = ifelse(is.na(journal_title.x), journal_title.y, journal_title.x)) %>%
left_join(df1, by = 'journal_title') %>%
mutate(issn = ifelse(is.na(issn.x), issn.y, issn.x)) %>%
select(journal_title, issn, rank)
With data:
df1 <- tibble(journal_title = c(NA,
"Journal of yyy",
"Journal of zzz",
"Journal of rrr"),
issn = c(9999, 1234, NA,1001))
df2 <- tibble(journal_title = c("Journal of xxx", NA, "Journal of zzz","Journal of rrr"),
issn = c(9999, 1234, 8888, NA),
rank = c(1,2,3,4))
> df1
# A tibble: 4 x 2
journal_title issn
<chr> <dbl>
1 NA 9999
2 Journal of yyy 1234
3 Journal of zzz NA
4 Journal of rrr 1001
> df2
# A tibble: 4 x 3
journal_title issn rank
<chr> <dbl> <dbl>
1 Journal of xxx 9999 1
2 NA 1234 2
3 Journal of zzz 8888 3
4 Journal of rrr NA 4
CodePudding user response:
It seems you cannot combine multiple join conditions with OR
to use journal_title as backup join key. Instead of two left_join
s with intermediate coalesce
as workaround you could do a full_join
and reduce your result set afterwards by grouping over issn.
df1 |>
dplyr::full_join(df2, by=c("issn", "journal_title")) |>
dplyr::group_by(issn) |>
dplyr::summarise(
rank = na.omit(rank),
journal_title = na.omit(journal_title)
)
# A tibble: 3 × 3
# Groups: issn [3]
issn rank journal_title
<dbl> <dbl> <chr>
1 1234 2 Journal of yyy
2 8888 3 Journal of zzz
3 9999 1 Journal of xxx
It works with your minimal example. If you have lots of data this might be a ineffecient way as your intermediate result is a lot bigger than the output.
CodePudding user response:
Another option with bind_rows()
:
bind_rows(df1, df2) |>
group_by(issn) |>
summarise(
journal_title = na.omit(journal_title),
rank = na.omit(rank)
)
# issn journal_title rank
# <dbl> <chr> <dbl>
# 1 1234 Journal of yyy 2
# 2 8888 Journal of zzz 3
# 3 9999 Journal of xxx 1