Home > Back-end >  Find overlap in terms between a pair of documents
Find overlap in terms between a pair of documents

Time:11-21

I have a sparse term-document matrix produced by tm's TermDocumentMatrix.

I am trying to write a function that takes two document names and k as its arguments, finds all terms that occur in both documents, sorts that list in descending order by the word count of the term, and returns the top k. Words in each term are separated by underscores (like bob_raids_crops).

Here's a toy example (where I sort by length instead of term wordcount):

library(tm)
library(dplyr)
data("crude")
tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

df  <- data.frame(term = row.names(tdm), as.matrix(tdm[, c("127", "144")]), row.names = NULL)
df$in.both <- ifelse(df[,2]>0 & df[,3]>0, TRUE, FALSE)
df <- df%>%
  subset(in.both == TRUE) %>%
  arrange(desc(str_length(term))) %>%
  select(term) %>%
  top_n(5,str_length(term))
df

This returns:

       term
1 companies
2   markets
3    market
4    prices
5    reuter

I was going to write a function, but am wondering if there is an existing way to do this. If not, can I make the above more efficient (like avoiding data frames)?

CodePudding user response:

Here is a solution with rowSums for summing the occurrences of words, and full_join for joining the 2 df made for each document. Applying na.omit() makes sure that only words which show up in both documents are counted. The words are arranged in descending order for the document 144.

library(tm)
#> Lade nötiges Paket: NLP
#library(dplyr)
library(tidyverse)
data("crude")
tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

one_44 <- rowSums(as.matrix(tdm[, "144"])) %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename("F" = ".") %>% 
  mutate(text = "one_44")
one_27 <- rowSums(as.matrix(tdm[, "127"])) %>% 
  as.data.frame() %>% 
  rownames_to_column() %>% 
  rename("F" = ".") %>% 
  mutate(text = "one_27")

one_27 %>% full_join(one_44, by = c('rowname', "F", 'text')) %>% 
  filter(F >0) %>% #distinct(text)
  pivot_wider(names_from = text, values_from = F) %>% 
  na.omit() %>% 
  arrange(desc(one_44))
#> # A tibble: 10 × 3
#>    rowname   one_27 one_44
#>    <chr>      <dbl>  <dbl>
#>  1 oil            5     12
#>  2 said           3     11
#>  3 prices         3      5
#>  4 market         1      3
#>  5 markets        1      2
#>  6 companies      1      1
#>  7 last           1      1
#>  8 price          2      1
#>  9 reuter         1      1
#> 10 two            1      1
  • Related