Home > OS >  Concatenate all rows of a columns and check if string contains any of its words
Concatenate all rows of a columns and check if string contains any of its words

Time:12-07

I have a dataset with citations and authors by group:

Group Citations Authors
1 das baker evans jumper
1 remmert biegert hauser wang bryson
2 morcos pagnani baker
2 mcguffin bryson jones trinu

For each group, I would like to check whether any (and if so, how many) of the names in the "Authors" column of other groups are contained in its "Citations column. For instance, for Group 1, the author "baker" from group 2 appears in the citations column of group 1, in row 1.

I think if I could obtain a dataframe like that, I would be able to answer the question:

Group Citations Authors_all_except_focal Present Occurrences
1 das baker baker trinu 1 1
1 remmert biegert hauser baker trinu 0 0
2 morcos pagnani evans jumper wang bryson 0 0
2 mcguffin bryson jones evans jumper wang bryson 1 1

I was thinking about concatenating the authors column into one string excluding the authors of the focal group and then use str_detect, but I am having trouble constructing this dataset (I have tried colSum but without success, apparently because it does not like strings).

CodePudding user response:

try this, it can be part of solution:

library(dplyr)
Group   =c(1,1,2,2)
Citations   = c("das baker","remmert biegert hauser","morcos pagnani","mcguffin bryson")
Authors =c("evans jumper","wang bryson","baker","trinu")

 df= data.frame(Group=Group,Citations=Citations,Authors=Authors)
 
 df_authors = df%>%
   mutate(all_authors=paste(Authors,collapse = " "))%>%
   group_by(Group)%>%
   mutate(Author_per_group=paste(Authors,collapse = " "))%>%
   ungroup()%>%
   rowwise()%>%
   mutate(Authors_all_except_focal=trimws(gsub(Author_per_group,'',all_authors)))%>%
   select(-c(all_authors,Author_per_group))%>%
   mutate(present_authors=
         # count the number of intersected words between the 2 column
          length(intersect(unlist(strsplit(Authors_all_except_focal," ")),
                           unlist(strsplit(Citations," "))))
          )
   

 # df_authors
 # # A tibble: 4 x 5
 # # Rowwise: 
 # Group Citations              Authors      Authors_all_except_focal present_authors
 # <dbl> <chr>                  <chr>        <chr>                              <int>
 # 1     1 das baker              evans jumper baker trinu                            1
 # 2     1 remmert biegert hauser wang bryson  baker trinu                            0
 # 3     2 morcos pagnani         baker        evans jumper wang bryson               0
 # 4     2 mcguffin bryson        trinu        evans jumper wang bryson               1

please give more details or more examples if you need more than this (e.g. 'occurrences' it can be for authors and for groups, so how do you want to proceed if many author are present in many groups).

  • Related