Search combined keywords in a pdf with R-CodePudding

I have a list of keywords to find text in a group of PDF files, some of the keyword must appear combined to extract the text even if they are not together.

I used the pdfsearch library and it finds text with the separated keywords. I read the documentation but I am not able to find a way to combine the keywords.

My code is as shown below:


    library(pdftools)
    library(pdfsearch)
    
    keywords <- c("LOTE","VOLUMEN",
    "LOTE","SOLVENCIA",
    "LOTE","SEGURO",
    "VOLUMEN","TRES ÚLTIMOS",
    "VOLUMEN","3 ÚLTIMOS",
    "VOLUMEN","(3) ÚLTIMOS",
    "NO", "APLICA", "SOLVENCIA")
    
    Results <- keyword_directory(directory,
                                            keyword = keywords,
                                            surround_lines = 1, full_names = TRUE, 
                                            ignore_case = TRUE, remove_hyphen = TRUE)

In the keyword assignation, every line is a combination:

"LOTE"   "SOLVENCIA",
"LOTE"   "SEGURO",
"VOLUMEN"   "TRES ÚLTIMOS",
"VOLUMEN"  "3 ÚLTIMOS",
"VOLUMEN"   "(3) ÚLTIMOS",
"NO"   "APLICA"   "SOLVENCIA"

For example the combination "NO" "APLICA" "SOLVENCIA"

This text should be extracted "No siempre aplica el uso de solvencia para el proyecto"

This text should no be extracted even if it has the keyword "NO" "No pueden contar con las listas antes de tiempo"

At the moment I am able just to get the text where the separated keyword appear.

CodePudding user response：

I am assuming you want all keywords in a 'group' of keywords to be present in a single line, in order for you to extract that single line. If you want the keywords to be present in a single file to extract all text from that file, let me know so I can adjust the answer.

Indeed pdfsearch::keyword_search() searches only individual words. Luckily it does give us a page number and a line number for each result, so we can match those and check if all words from a single group are present in the search results on the same line:

Preparation

We start by defining our keywords grouped into vectors, and loading an example file:

library(pdfsearch)
library(dplyr)

# Our list of keywords, grouped in vectors
grouped_keywords <- list(c('saturated','model'),
                 c('vector','specification'),
                 c('framework','inferences'),
                 c('test','that','gives','no','results'),
                 c('population','degree','types'))

# Example file supplied with `pdfsearch`, also available at https://arxiv.org/pdf/1610.00147.pdf
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')

Step 1: search all words individually

To start the search, we perform keyword_search() on a flattened version of grouped_keywords. This will yield all results we want, but also many results we don't want (lines that only contain one or a few of the keywords in a group).

# Search for individual keywords
individual_results <- keyword_search(file, 
                         keyword = unlist(grouped_keywords), # combine our keyword list into a single 1-dimensional vector
                         path = TRUE)

cat(nrow(individual_results), 'results for individual words\n')
head(individual_results, n=3)

Result:

367 results for individual words

# A tibble: 3 × 5
  keyword   page_num line_num line_text token_text
  <chr>        <int>    <int> <list>    <list>    
1 saturated        5      112 <chr [1]> <list [1]>
2 saturated        5      114 <chr [1]> <list [1]>
3 saturated        5      119 <chr [1]> <list [1]>

Step 2: Merge results for keywords in the same subgroup

For each group of keywords, we look for results that have the same line number and the same page number, ánd that match all keywords in the group:

combined_results <- lapply(grouped_keywords, \(keyword_group) {
  
  individual_results %>%
    filter(keyword %in% keyword_group) %>%
    group_by(page_num, line_num) %>%
    filter(length(unique(keyword)) == length(unique(keyword_group))) %>%
    summarise(keywords = paste(keyword_group, collapse='   '),
              line_text = line_text[1],
              token_text = token_text[1],
              .groups="keep")
  
})

# Merge list of tibbles to a single tibble
combined_results <- do.call(rbind, combined_results)

# Output result
cat(nrow(combined_results), 'results for combined words\n')
combined_results

Result:

8 results for combined words

# A tibble: 8 × 5
# Groups:   page_num, line_num [8]
  page_num line_num keywords                    line_text token_text
     <int>    <int> <chr>                       <list>    <list>    
1        5      112 saturated   model           <chr [1]> <list [1]>
2        5      114 saturated   model           <chr [1]> <list [1]>
3        5      119 saturated   model           <chr [1]> <list [1]>
4        7      184 saturated   model           <chr [1]> <list [1]>
5        5      124 vector   specification      <chr [1]> <list [1]>
6        2       32 framework   inferences      <chr [1]> <list [1]>
7        7      168 framework   inferences      <chr [1]> <list [1]>
8        7      187 population   degree   types <chr [1]> <list [1]>