I am trying to count a serie of words extract from a PDF but I get only 0 and it is not correct.
total_number_of_keywords = 0
pdf_file = "CapitalCorp.pdf"
tables=[]
words = ['blank','warrant ','offering','combination ','SPAC','founders']
count={} # is a dictionary data structure in Python
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_tables()
for elem in words:
count[elem] = 0
for line in f'{i} --- {tbl}' :
elements = line.split()
for word in words:
count[word] = count[word] elements.count(word)
print (count)
CodePudding user response:
This will do the job:
import pdfplumber
pdf_file = "CapitalCorp.pdf"
words = ['blank','warrant ','offering','combination ','SPAC','founders']
# Get text
text = ''
with pdfplumber.open(pdf_file) as pdf:
for i, page in enumerate(pdf.pages):
text = text '\n' str(page.extract_text())
# Setup count dictionary
count = {}
for elem in words:
count[elem] = 0
# Count occurences
for i, el in enumerate(words):
count[f'{words[i]}'] = text.count(el)
First, you store the content of PDF in the variable text
, which is a string.
Then, you setup the count
dictionary, with one key fo every element of words
and respective values to 0.
Last, you count the occurrences of every element of words
in text
with the count()
method and store it in the respective key of your count
dictionary.