I have a list of keywords to find text in a group of PDF files, some of the keyword must appear combined to extract the text even if they are not together.
I used the pdfsearch library and it finds text with the separated keywords. I read the documentation but I am not able to find a way to combine the keywords.
My code is as shown below:
library(pdftools)
library(pdfsearch)
keywords <- c("LOTE","VOLUMEN",
"LOTE","SOLVENCIA",
"LOTE","SEGURO",
"VOLUMEN","TRES ÚLTIMOS",
"VOLUMEN","3 ÚLTIMOS",
"VOLUMEN","(3) ÚLTIMOS",
"NO", "APLICA", "SOLVENCIA")
Results <- keyword_directory(directory,
keyword = keywords,
surround_lines = 1, full_names = TRUE,
ignore_case = TRUE, remove_hyphen = TRUE)
In the keyword assignation, every line is a combination:
"LOTE" "SOLVENCIA",
"LOTE" "SEGURO",
"VOLUMEN" "TRES ÚLTIMOS",
"VOLUMEN" "3 ÚLTIMOS",
"VOLUMEN" "(3) ÚLTIMOS",
"NO" "APLICA" "SOLVENCIA"
For example the combination "NO" "APLICA" "SOLVENCIA"
This text should be extracted "No siempre aplica el uso de solvencia para el proyecto"
This text should no be extracted even if it has the keyword "NO" "No pueden contar con las listas antes de tiempo"
At the moment I am able just to get the text where the separated keyword appear.
CodePudding user response:
I am assuming you want all keywords in a 'group' of keywords to be present in a single line, in order for you to extract that single line. If you want the keywords to be present in a single file to extract all text from that file, let me know so I can adjust the answer.
Indeed pdfsearch::keyword_search()
searches only individual words. Luckily it does give us a page number and a line number for each result, so we can match those and check if all words from a single group are present in the search results on the same line:
Preparation
We start by defining our keywords grouped into vectors, and loading an example file:
library(pdfsearch)
library(dplyr)
# Our list of keywords, grouped in vectors
grouped_keywords <- list(c('saturated','model'),
c('vector','specification'),
c('framework','inferences'),
c('test','that','gives','no','results'),
c('population','degree','types'))
# Example file supplied with `pdfsearch`, also available at https://arxiv.org/pdf/1610.00147.pdf
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
Step 1: search all words individually
To start the search, we perform keyword_search()
on a flattened version of grouped_keywords
. This will yield all results we want, but also many results we don't want (lines that only contain one or a few of the keywords in a group).
# Search for individual keywords
individual_results <- keyword_search(file,
keyword = unlist(grouped_keywords), # combine our keyword list into a single 1-dimensional vector
path = TRUE)
cat(nrow(individual_results), 'results for individual words\n')
head(individual_results, n=3)
Result:
367 results for individual words
# A tibble: 3 × 5
keyword page_num line_num line_text token_text
<chr> <int> <int> <list> <list>
1 saturated 5 112 <chr [1]> <list [1]>
2 saturated 5 114 <chr [1]> <list [1]>
3 saturated 5 119 <chr [1]> <list [1]>
Step 2: Merge results for keywords in the same subgroup
For each group of keywords, we look for results that have the same line number and the same page number, ánd that match all keywords in the group:
combined_results <- lapply(grouped_keywords, \(keyword_group) {
individual_results %>%
filter(keyword %in% keyword_group) %>%
group_by(page_num, line_num) %>%
filter(length(unique(keyword)) == length(unique(keyword_group))) %>%
summarise(keywords = paste(keyword_group, collapse=' '),
line_text = line_text[1],
token_text = token_text[1],
.groups="keep")
})
# Merge list of tibbles to a single tibble
combined_results <- do.call(rbind, combined_results)
# Output result
cat(nrow(combined_results), 'results for combined words\n')
combined_results
Result:
8 results for combined words
# A tibble: 8 × 5
# Groups: page_num, line_num [8]
page_num line_num keywords line_text token_text
<int> <int> <chr> <list> <list>
1 5 112 saturated model <chr [1]> <list [1]>
2 5 114 saturated model <chr [1]> <list [1]>
3 5 119 saturated model <chr [1]> <list [1]>
4 7 184 saturated model <chr [1]> <list [1]>
5 5 124 vector specification <chr [1]> <list [1]>
6 2 32 framework inferences <chr [1]> <list [1]>
7 7 168 framework inferences <chr [1]> <list [1]>
8 7 187 population degree types <chr [1]> <list [1]>