Home > Enterprise >  Joining datasets and coalescing ID columns in R
Joining datasets and coalescing ID columns in R

Time:08-16

I am trying to join a list of tibbles from various datasources. Basically I have a list of journals for which I am trying to add some information from other sources.

One of the purposes to do this is to fill out missing data in some columns which are also being used for joining. For the sake of example, I have the following two datasets that resemble the structure of my data.

df1 <- tibble(journal_title = c(NA, 
                           "Journal of yyy",
                           "Journal of zzz"),
         issn = c(9999, 1234, NA))

df2 <- tibble(journal_title = c("Journal of xxx", NA, "Journal of zzz"),
              issn = c(9999, 1234, 8888),
              rank = c(1,2,3))

> df1
# A tibble: 3 × 2
  journal_title   issn
  <chr>          <dbl>
1 NA              9999
2 Journal of yyy  1234
3 Journal of zzz    NA

> df2
# A tibble: 3 × 3
  journal_title   issn  rank
  <chr>          <dbl> <dbl>
1 Journal of xxx  9999     1
2 NA              1234     2
3 Journal of zzz  8888     3

I wish to join the two datasets and basically carry out a left join where x = df1 and y = df2, i.e. df1 is the main data to which i want to add columns from df2.

However, as shown in the data, there are two ID columns which should be used for the join function. The issue is that there might be NA in one of the columns. Therefore, using by = c("issn", "journal_title") doesnt work.

Therefore i want to:

  1. Join the datasets by both columns (issn and journal_title), seeing as there might be NA in one of them. I want to keep issn as the "first try", and then if there is not a match in df2, journal_title should be used.

  2. Fill out the NAs with values from the two datasets.

I have tried making a "synthetic" ID-column by coalescing the two columns with x = issn and y = journal_title. However, this doesn't work as it doesn't factor in that in some cases, for example, both issn and journal_title is present the first dataset, whereas for the corresponding issn, only journal_title is present in the second dataset.

My goal data looks like this:

df3 <- tibble(journal_title = c("Journal of xxx", "Journal of yyy", "Journal of zzz"),
              issn = c(9999, 1234, 8888),
              rank = c(1,2,3))

>  df3
# A tibble: 3 × 3
  journal_title   issn  rank
  <chr>          <dbl> <dbl>
1 Journal of xxx  9999     1
2 Journal of yyy  1234     2
3 Journal of zzz  8888     3

I hope I have made myself clear, and any help is appreciated!

CodePudding user response:

Maybe you can start from here. I do not know if it covers all your cases, but it covers the cases you've published:

library(dplyr)

df2 %>%
  left_join(df1, by ='issn') %>%
  mutate(journal_title = ifelse(is.na(journal_title.x), journal_title.y, journal_title.x)) %>%
  select(journal_title, issn, rank)

# A tibble: 3 x 3
  journal_title   issn  rank
  <chr>          <dbl> <dbl>
1 Journal of xxx  9999     1
2 Journal of yyy  1234     2
3 Journal of zzz  8888     3

This solution covers if you have some NA in df2$issn and the issn is in df1:

df2 %>%
  left_join(df1, by ='issn') %>%
  mutate(journal_title = ifelse(is.na(journal_title.x), journal_title.y, journal_title.x)) %>%
  left_join(df1, by = 'journal_title') %>%
  mutate(issn = ifelse(is.na(issn.x), issn.y, issn.x)) %>%
  select(journal_title, issn, rank)

With data:

df1 <- tibble(journal_title = c(NA, 
                                "Journal of yyy",
                                "Journal of zzz",
                                "Journal of rrr"),
              issn = c(9999, 1234, NA,1001))

df2 <- tibble(journal_title = c("Journal of xxx", NA, "Journal of zzz","Journal of rrr"),
              issn = c(9999, 1234, 8888, NA),
              rank = c(1,2,3,4))


> df1
# A tibble: 4 x 2
  journal_title   issn
  <chr>          <dbl>
1 NA              9999
2 Journal of yyy  1234
3 Journal of zzz    NA
4 Journal of rrr  1001
> df2
# A tibble: 4 x 3
  journal_title   issn  rank
  <chr>          <dbl> <dbl>
1 Journal of xxx  9999     1
2 NA              1234     2
3 Journal of zzz  8888     3
4 Journal of rrr    NA     4

CodePudding user response:

It seems you cannot combine multiple join conditions with OR to use journal_title as backup join key. Instead of two left_joins with intermediate coalesce as workaround you could do a full_join and reduce your result set afterwards by grouping over issn.

df1 |> 
  dplyr::full_join(df2, by=c("issn", "journal_title")) |> 
  dplyr::group_by(issn) |> 
  dplyr::summarise(
    rank = na.omit(rank),
    journal_title = na.omit(journal_title)
    )


# A tibble: 3 × 3
# Groups:   issn [3]
   issn  rank journal_title 
  <dbl> <dbl> <chr>         
1  1234     2 Journal of yyy
2  8888     3 Journal of zzz
3  9999     1 Journal of xxx

It works with your minimal example. If you have lots of data this might be a ineffecient way as your intermediate result is a lot bigger than the output.

CodePudding user response:

Another option with bind_rows():

bind_rows(df1, df2) |>
  group_by(issn) |> 
  summarise(
    journal_title = na.omit(journal_title),
    rank          = na.omit(rank)
  )
#    issn journal_title   rank
#   <dbl> <chr>          <dbl>
# 1  1234 Journal of yyy     2
# 2  8888 Journal of zzz     3
# 3  9999 Journal of xxx     1
  • Related