Concatenate all rows of a columns and check if string contains any of its words-CodePudding

I have a dataset with citations and authors by group:

Group	Citations	Authors
1	das baker	evans jumper
1	remmert biegert hauser	wang bryson
2	morcos pagnani	baker
2	mcguffin bryson jones	trinu

For each group, I would like to check whether any (and if so, how many) of the names in the "Authors" column of other groups are contained in its "Citations column. For instance, for Group 1, the author "baker" from group 2 appears in the citations column of group 1, in row 1.

I think if I could obtain a dataframe like that, I would be able to answer the question:

Group	Citations	Authors_all_except_focal	Present	Occurrences
1	das baker	baker trinu	1	1
1	remmert biegert hauser	baker trinu	0	0
2	morcos pagnani	evans jumper wang bryson	0	0
2	mcguffin bryson jones	evans jumper wang bryson	1	1

I was thinking about concatenating the authors column into one string excluding the authors of the focal group and then use str_detect, but I am having trouble constructing this dataset (I have tried colSum but without success, apparently because it does not like strings).

CodePudding user response：

try this, it can be part of solution:

library(dplyr)
Group   =c(1,1,2,2)
Citations   = c("das baker","remmert biegert hauser","morcos pagnani","mcguffin bryson")
Authors =c("evans jumper","wang bryson","baker","trinu")

 df= data.frame(Group=Group,Citations=Citations,Authors=Authors)
 
 df_authors = df%>%
   mutate(all_authors=paste(Authors,collapse = " "))%>%
   group_by(Group)%>%
   mutate(Author_per_group=paste(Authors,collapse = " "))%>%
   ungroup()%>%
   rowwise()%>%
   mutate(Authors_all_except_focal=trimws(gsub(Author_per_group,'',all_authors)))%>%
   select(-c(all_authors,Author_per_group))%>%
   mutate(present_authors=
         # count the number of intersected words between the 2 column
          length(intersect(unlist(strsplit(Authors_all_except_focal," ")),
                           unlist(strsplit(Citations," "))))
          )
   

 # df_authors
 # # A tibble: 4 x 5
 # # Rowwise: 
 # Group Citations              Authors      Authors_all_except_focal present_authors
 # <dbl> <chr>                  <chr>        <chr>                              <int>
 # 1     1 das baker              evans jumper baker trinu                            1
 # 2     1 remmert biegert hauser wang bryson  baker trinu                            0
 # 3     2 morcos pagnani         baker        evans jumper wang bryson               0
 # 4     2 mcguffin bryson        trinu        evans jumper wang bryson               1

please give more details or more examples if you need more than this (e.g. 'occurrences' it can be for authors and for groups, so how do you want to proceed if many author are present in many groups).