I have a dataset with citations and authors by group:
Group | Citations | Authors |
---|---|---|
1 | das baker | evans jumper |
1 | remmert biegert hauser | wang bryson |
2 | morcos pagnani | baker |
2 | mcguffin bryson jones | trinu |
For each group, I would like to check whether any (and if so, how many) of the names in the "Authors" column of other groups are contained in its "Citations column. For instance, for Group 1, the author "baker" from group 2 appears in the citations column of group 1, in row 1.
I think if I could obtain a dataframe like that, I would be able to answer the question:
Group | Citations | Authors_all_except_focal | Present | Occurrences |
---|---|---|---|---|
1 | das baker | baker trinu | 1 | 1 |
1 | remmert biegert hauser | baker trinu | 0 | 0 |
2 | morcos pagnani | evans jumper wang bryson | 0 | 0 |
2 | mcguffin bryson jones | evans jumper wang bryson | 1 | 1 |
I was thinking about concatenating the authors column into one string excluding the authors of the focal group and then use str_detect, but I am having trouble constructing this dataset (I have tried colSum but without success, apparently because it does not like strings).
CodePudding user response:
try this, it can be part of solution:
library(dplyr)
Group =c(1,1,2,2)
Citations = c("das baker","remmert biegert hauser","morcos pagnani","mcguffin bryson")
Authors =c("evans jumper","wang bryson","baker","trinu")
df= data.frame(Group=Group,Citations=Citations,Authors=Authors)
df_authors = df%>%
mutate(all_authors=paste(Authors,collapse = " "))%>%
group_by(Group)%>%
mutate(Author_per_group=paste(Authors,collapse = " "))%>%
ungroup()%>%
rowwise()%>%
mutate(Authors_all_except_focal=trimws(gsub(Author_per_group,'',all_authors)))%>%
select(-c(all_authors,Author_per_group))%>%
mutate(present_authors=
# count the number of intersected words between the 2 column
length(intersect(unlist(strsplit(Authors_all_except_focal," ")),
unlist(strsplit(Citations," "))))
)
# df_authors
# # A tibble: 4 x 5
# # Rowwise:
# Group Citations Authors Authors_all_except_focal present_authors
# <dbl> <chr> <chr> <chr> <int>
# 1 1 das baker evans jumper baker trinu 1
# 2 1 remmert biegert hauser wang bryson baker trinu 0
# 3 2 morcos pagnani baker evans jumper wang bryson 0
# 4 2 mcguffin bryson trinu evans jumper wang bryson 1
please give more details or more examples if you need more than this (e.g. 'occurrences' it can be for authors and for groups, so how do you want to proceed if many author are present in many groups).