Home > Net >  Several columns matching over 2 data.frames
Several columns matching over 2 data.frames

Time:08-05

I have two large, messy datasets. The structure of the variables for this issue roughly looks like this:

First, a dataset containing some IDs.

DF1 <- tibble(
  ID1 = c(1, 2, 3), 
  ID2 = c(4, 5, NA),
  ID3 = c(9, NA, NA)
)
    
> DF1
# A tibble: 3 × 3
    ID1   ID2   ID3
  <dbl> <dbl> <dbl>
1     1     4     9
2     2     5    NA
3     3    NA    NA

Secondly, a dataset containing IDs as well as titles connected to those IDs

DF2 <- tibble(
  ID1_2 = c(2, 4, 1), 
  ID2_2 = c(5, NA, NA), 
  ID3_2 = c(10, NA, NA), 
  Title = c("Journal X", "Journal Y", "Journal Z")
)

> DF2
# A tibble: 3 × 4
  ID1_2 ID2_2 ID3_2 Title    
  <dbl> <dbl> <dbl> <chr>    
1     2     5    10 Journal X
2     4    NA    NA Journal Y
3     1    NA    NA Journal Z

The IDs I'm working with might have several titles connected to them, and the titles might have several IDs connected as well.

What I am looking to do is to obtain a column containing titles (from Title in DF2) based upon whether any IDs in the columns ID1, ID2 or ID3 matches any of the IDs in ID1_2, ID2_2 or ID3_2. Thus, if any of the IDs in a row of DF1 matches any ID in DF2, the value of Match_titles should be the value of Title in DF2. However, there might me several matches, why this value should state each match, separated by a comma. To illustrate, the goal data should look like the following:

DF3 <- tibble(
  ID1 = c(1, 2, 3), 
  ID2 = c(4, 5, NA),
  ID3 = c(9, NA, NA), 
  Match_titles = c("Journal Z, Journal Y", 
                   "Journal X, Journal X" ,
                   NA)
)

> DF3
# A tibble: 3 × 4
    ID1   ID2   ID3 Match_titles        
  <dbl> <dbl> <dbl> <chr>               
1     1     4     9 Journal Z, Journal Y
2     2     5    NA Journal X, Journal X
3     3    NA    NA NA                  

I find it a bit challenging to explain, but I hope it is clear - any help is absolutely helpful :)

CodePudding user response:

First let's pivot DF2 longer:

df2_longer <- DF2 %>% pivot_longer(cols = c("ID1_2", "ID2_2", "ID3_2"))
# A tibble: 9 x 3
  Title     name  value
  <chr>     <chr> <dbl>
1 Journal X ID1_2     2
2 Journal X ID2_2     5
3 Journal X ID3_2    10
4 Journal Y ID1_2     4
5 Journal Y ID2_2    NA
6 Journal Y ID3_2    NA
7 Journal Z ID1_2     1
8 Journal Z ID2_2    NA
9 Journal Z ID3_2    NA

This then can be more easily done by:

res <- DF1 %>% group_by(ID1, ID2, ID3) %>%  
mutate(match_list = list(df2_longer$Title[df2_longer$value %in% c(ID1, ID2, ID3)[!is.na(c(ID1, ID2, ID3))]]) )

res$match_list

> res$match_list
[[1]]
[1] "Journal Y" "Journal Z"

[[2]]
[1] "Journal X" "Journal X"

[[3]]
character(0)

As you can see Match_list is a list of matching elements which we can now paste together.

res <- res %>% mutate(Match_titles = paste(unlist(match_list), collapse = ", "))
res

# A tibble: 3 x 5
# Groups:   ID1, ID2, ID3 [3]
    ID1   ID2   ID3 match_list Match_titles          
  <dbl> <dbl> <dbl> <list>     <chr>                 
1     1     4     9 <chr [2]>  "Journal Y, Journal Z"
2     2     5    NA <chr [2]>  "Journal X, Journal X"
3     3    NA    NA <chr [0]>  ""    
  • Related