i'm a total new in python, could you help me correct this code?
I would like to add 2 things:
- do the operation on multiple pdf and not just one and pasting the content in A2,A3 A4 and so on
- if possible writing in the another row (B2,B3,B4) the name of the pdf file.
Thank you in advance, this is the code i'm working with
import PyPDF2
import openpyxl
pdfFileObj = open("file.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'
sheet['A1'] = mytext
wb.save('excel.xlsx')
print('DONE!!')
I've modified the code as suggested and the cycle seems to get all the pages! but maybe i have to work with "sheet[f'A{row}'].value = '\n'.join(output)" because it seems to print a lot of spaces
import PyPDF2
import openpyxl
import os
import glob
root_dir = "your directory"
filenames = []
# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir '**/**', recursive=True):
if filename.lower().endswith('.pdf'):
filenames.append(os.path.join(directory, filename))
wb = openpyxl.load_workbook('excel.xlsx')#your file excel
sheet = wb.active
sheet.title = 'MyPDF'
for row, filename in enumerate(filenames, start=1):
with open(filename, 'rb') as f:
pdfReader = PyPDF2.PdfFileReader(f)
count=pdfReader.numPages
pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
for i in range(count):
page = pdfReader.getPage(i)
output = []
output = page.extractText()
print(output)
sheet[f'A{row}'].value = '\n'.join(output)
sheet[f'B{row}'].value = filename
wb.save('excel.xlsx') #your file excel
print('DONE!!')
CodePudding user response:
You basically want to put the code you wrote which reads the pdf file into a for
loop which iterates over the filenames (in this case, the filenames are stored as a tuple
).
Using enumerate
, row
increments every iteration of the loop, and starts at 1. So the text and filename will be put into A1 and B1, then A2 and B2, and so on.
import PyPDF2
import openpyxl
filenames = ("file.pdf",
"file2.pdf",
"file3.pdf",
)
wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'
for row, filename in enumerate(filenames, start=1):
with open(filename, 'rb') as f:
pdfReader = PyPDF2.PdfFileReader(f)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
sheet[f'A{row}'].value = mytext
sheet[f'B{row}'].value = filename
wb.save('excel.xlsx')
print('DONE!!')
You can get a list of all the filenames ending in .pdf
quite easily by iterating over all the files in a directory, and checking if the filename ends in .pdf
. If it does, use os.path.join
to give you the full filepath, and append it to the filenames
list.
You could also use the glob
module, too.
import os
filenames = []
directory = r"C:\Stuff\PDF Files"
for filename in os.listdir(directory):
if filename.lower().endswith(".pdf"):
filenames.append(os.path.join(directory, filename))
Final code:
import PyPDF2
import openpyxl
import os
filenames = []
directory = r'C:\Stuff\PDF Files'
for filename in os.listdir(directory):
if filename.lower().endswith('.pdf'):
filenames.append(os.path.join(directory, filename))
wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'
for row, filename in enumerate(filenames, start=1):
with open(filename, 'rb') as f:
pdfReader = PyPDF2.PdfFileReader(f)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
sheet[f'A{row}'].value = mytext
sheet[f'B{row}'].value = filename
wb.save('excel.xlsx')
print('DONE!!')