Home > Software design >  searching a string of text in R producing crossable
searching a string of text in R producing crossable

Time:09-09

I have a data frame where columns have strings of text. I want to search within the text of multiple columns and count the occurrence of specific words, and then make a table that reports the co-occurence (by multiple columns).

Here's a small data frame to make this reproducible:

#create dataframe


TI <- c('Studies of pancreatic cancer',
        'colon, breast, and pancreatic cancer',
        'cancer and CVD paper', 
        'CVD paper'
        )
AB <- c('Autoimmune pancreatitis (AIP) is now considered a disease to be treated by diet.',
        'dietary patterns, positive associations were found between pancreatic cancer risk ',
        'Cardiovascular diseases (CVD) is linked to Diet',
        'making this match with pancreas'
        )

df <- data.frame(TI, AB)

#TI = title (of a paper)
#AB = abstract of the same paper

I then want to see how many titles have either the word 'cancer' or 'CVD' in the title (TI column), and the co-occurrence with the words 'pancreatic (or some variation)' or 'diet' in the abstract (AB column)

I can count the single occurence of the words I want with grep

pancreatic.ab <- length(grep("pancreat*", df$AB, ignore.case = TRUE, perl = FALSE))
pancreatic.ab

diet.ab <- length(grep("diet*", df$AB, ignore.case = TRUE, perl = FALSE))
diet.ab

cancer.ti <- length(grep("cancer*", df$TI, ignore.case = TRUE, perl = FALSE))
cancer.ti

CVD.ti <- length(grep("CVD", df$TI, ignore.case = TRUE, perl = FALSE))
CVD.ti

but not sure how to do this for a complicated cross tabs table.

any suggestions?

An example of desired output would be something like this example table

CodePudding user response:

You can bind the rows that have your targets in the TI columns, providing a label for the columns belonging to each (in my example ti). Then, grouping by that label (i.e. group_by(ti)), you can create the columns with the counts, using summarize()

bind_rows(
  df %>% filter(grepl("cancer*", TI, ignore.case=T)) %>% mutate(ti="CANCER"),
  df %>% filter(grepl("CVD*", TI, ignore.case=T)) %>% mutate(ti="CVD")
) %>% 
  group_by(ti) %>% 
  summarize(
    TotalAB = n(),
    Pancreatic = sum(grepl("pancreat*", AB, ignore.case=T)),
    Diet = sum(grepl("diet*", AB, ignore.case=T))
    )

Output:

  ti     TotalAB Pancreatic  Diet
  <chr>    <int>      <int> <int>
1 CANCER       3          2     3
2 CVD          2          1     1
  • Related