I have two large, messy datasets. The structure of the variables for this issue roughly looks like this:
First, a dataset containing some IDs.
DF1 <- tibble(
ID1 = c(1, 2, 3),
ID2 = c(4, 5, NA),
ID3 = c(9, NA, NA)
)
> DF1
# A tibble: 3 × 3
ID1 ID2 ID3
<dbl> <dbl> <dbl>
1 1 4 9
2 2 5 NA
3 3 NA NA
Secondly, a dataset containing IDs as well as titles connected to those IDs
DF2 <- tibble(
ID1_2 = c(2, 4, 1),
ID2_2 = c(5, NA, NA),
ID3_2 = c(10, NA, NA),
Title = c("Journal X", "Journal Y", "Journal Z")
)
> DF2
# A tibble: 3 × 4
ID1_2 ID2_2 ID3_2 Title
<dbl> <dbl> <dbl> <chr>
1 2 5 10 Journal X
2 4 NA NA Journal Y
3 1 NA NA Journal Z
The IDs I'm working with might have several titles connected to them, and the titles might have several IDs connected as well.
What I am looking to do is to obtain a column containing titles (from Title
in DF2
) based upon whether any IDs in the columns ID1
, ID2
or ID3
matches any of the IDs in ID1_2
, ID2_2
or ID3_2
. Thus, if any of the IDs in a row of DF1
matches any ID in DF2
, the value of Match_titles should be the value of Title
in DF2
. However, there might me several matches, why this value should state each match, separated by a comma. To illustrate, the goal data should look like the following:
DF3 <- tibble(
ID1 = c(1, 2, 3),
ID2 = c(4, 5, NA),
ID3 = c(9, NA, NA),
Match_titles = c("Journal Z, Journal Y",
"Journal X, Journal X" ,
NA)
)
> DF3
# A tibble: 3 × 4
ID1 ID2 ID3 Match_titles
<dbl> <dbl> <dbl> <chr>
1 1 4 9 Journal Z, Journal Y
2 2 5 NA Journal X, Journal X
3 3 NA NA NA
I find it a bit challenging to explain, but I hope it is clear - any help is absolutely helpful :)
CodePudding user response:
First let's pivot DF2
longer:
df2_longer <- DF2 %>% pivot_longer(cols = c("ID1_2", "ID2_2", "ID3_2"))
# A tibble: 9 x 3
Title name value
<chr> <chr> <dbl>
1 Journal X ID1_2 2
2 Journal X ID2_2 5
3 Journal X ID3_2 10
4 Journal Y ID1_2 4
5 Journal Y ID2_2 NA
6 Journal Y ID3_2 NA
7 Journal Z ID1_2 1
8 Journal Z ID2_2 NA
9 Journal Z ID3_2 NA
This then can be more easily done by:
res <- DF1 %>% group_by(ID1, ID2, ID3) %>%
mutate(match_list = list(df2_longer$Title[df2_longer$value %in% c(ID1, ID2, ID3)[!is.na(c(ID1, ID2, ID3))]]) )
res$match_list
> res$match_list
[[1]]
[1] "Journal Y" "Journal Z"
[[2]]
[1] "Journal X" "Journal X"
[[3]]
character(0)
As you can see Match_list
is a list
of matching elements which we can now paste together.
res <- res %>% mutate(Match_titles = paste(unlist(match_list), collapse = ", "))
res
# A tibble: 3 x 5
# Groups: ID1, ID2, ID3 [3]
ID1 ID2 ID3 match_list Match_titles
<dbl> <dbl> <dbl> <list> <chr>
1 1 4 9 <chr [2]> "Journal Y, Journal Z"
2 2 5 NA <chr [2]> "Journal X, Journal X"
3 3 NA NA <chr [0]> ""