searching a string of text in R producing crossable-CodePudding

I have a data frame where columns have strings of text. I want to search within the text of multiple columns and count the occurrence of specific words, and then make a table that reports the co-occurence (by multiple columns).

Here's a small data frame to make this reproducible:

#create dataframe


TI <- c('Studies of pancreatic cancer',
        'colon, breast, and pancreatic cancer',
        'cancer and CVD paper', 
        'CVD paper'
        )
AB <- c('Autoimmune pancreatitis (AIP) is now considered a disease to be treated by diet.',
        'dietary patterns, positive associations were found between pancreatic cancer risk ',
        'Cardiovascular diseases (CVD) is linked to Diet',
        'making this match with pancreas'
        )

df <- data.frame(TI, AB)

#TI = title (of a paper)
#AB = abstract of the same paper

I then want to see how many titles have either the word 'cancer' or 'CVD' in the title (TI column), and the co-occurrence with the words 'pancreatic (or some variation)' or 'diet' in the abstract (AB column)

I can count the single occurence of the words I want with grep

pancreatic.ab <- length(grep("pancreat*", df$AB, ignore.case = TRUE, perl = FALSE))
pancreatic.ab

diet.ab <- length(grep("diet*", df$AB, ignore.case = TRUE, perl = FALSE))
diet.ab

cancer.ti <- length(grep("cancer*", df$TI, ignore.case = TRUE, perl = FALSE))
cancer.ti

CVD.ti <- length(grep("CVD", df$TI, ignore.case = TRUE, perl = FALSE))
CVD.ti

but not sure how to do this for a complicated cross tabs table.

any suggestions?

An example of desired output would be something like this

CodePudding user response：

You can bind the rows that have your targets in the TI columns, providing a label for the columns belonging to each (in my example ti). Then, grouping by that label (i.e. group_by(ti)), you can create the columns with the counts, using summarize()

bind_rows(
  df %>% filter(grepl("cancer*", TI, ignore.case=T)) %>% mutate(ti="CANCER"),
  df %>% filter(grepl("CVD*", TI, ignore.case=T)) %>% mutate(ti="CVD")
) %>% 
  group_by(ti) %>% 
  summarize(
    TotalAB = n(),
    Pancreatic = sum(grepl("pancreat*", AB, ignore.case=T)),
    Diet = sum(grepl("diet*", AB, ignore.case=T))
    )

Output:

  ti     TotalAB Pancreatic  Diet
  <chr>    <int>      <int> <int>
1 CANCER       3          2     3
2 CVD          2          1     1