Python: Extract text from multiple pdf and paste on excel-CodePudding

i'm a total new in python, could you help me correct this code?

I would like to add 2 things:

do the operation on multiple pdf and not just one and pasting the content in A2,A3 A4 and so on
if possible writing in the another row (B2,B3,B4) the name of the pdf file.

Thank you in advance, this is the code i'm working with

import PyPDF2
import openpyxl
pdfFileObj = open("file.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages

pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'
sheet['A1'] = mytext

wb.save('excel.xlsx')
print('DONE!!')

I've modified the code as suggested and the cycle seems to get all the pages! but maybe i have to work with "sheet[f'A{row}'].value = '\n'.join(output)" because it seems to print a lot of spaces


import PyPDF2
import openpyxl
import os
import glob
root_dir = "your directory"

filenames = []
# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir   '**/**', recursive=True):
    if filename.lower().endswith('.pdf'):
        filenames.append(os.path.join(directory, filename))
        

wb = openpyxl.load_workbook('excel.xlsx')#your file excel
sheet = wb.active
sheet.title = 'MyPDF'

for row, filename in enumerate(filenames, start=1):
    with open(filename, 'rb') as f:
        pdfReader = PyPDF2.PdfFileReader(f)
        count=pdfReader.numPages
        pageObj = pdfReader.getPage(0)
        mytext = pageObj.extractText()
        for i in range(count): 
            page = pdfReader.getPage(i)
            output = []
            output = page.extractText() 
            print(output)

    sheet[f'A{row}'].value = '\n'.join(output)
    sheet[f'B{row}'].value = filename

wb.save('excel.xlsx') #your file excel
print('DONE!!')

CodePudding user response：

You basically want to put the code you wrote which reads the pdf file into a for loop which iterates over the filenames (in this case, the filenames are stored as a tuple).

Using enumerate, row increments every iteration of the loop, and starts at 1. So the text and filename will be put into A1 and B1, then A2 and B2, and so on.

import PyPDF2
import openpyxl

filenames = ("file.pdf", 
             "file2.pdf", 
             "file3.pdf", 
            )

wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'

for row, filename in enumerate(filenames, start=1):
    with open(filename, 'rb') as f:
        pdfReader = PyPDF2.PdfFileReader(f)
        pdfReader.numPages
        pageObj = pdfReader.getPage(0)
        mytext = pageObj.extractText()
    
    sheet[f'A{row}'].value = mytext
    sheet[f'B{row}'].value = filename

wb.save('excel.xlsx')
print('DONE!!')

You can get a list of all the filenames ending in .pdf quite easily by iterating over all the files in a directory, and checking if the filename ends in .pdf. If it does, use os.path.join to give you the full filepath, and append it to the filenames list.

You could also use the glob module, too.

import os

filenames = []
directory = r"C:\Stuff\PDF Files"
for filename in os.listdir(directory):
    if filename.lower().endswith(".pdf"):
        filenames.append(os.path.join(directory, filename))

Final code:

import PyPDF2
import openpyxl
import os

filenames = []
directory = r'C:\Stuff\PDF Files'
for filename in os.listdir(directory):
    if filename.lower().endswith('.pdf'):
        filenames.append(os.path.join(directory, filename))

wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'

for row, filename in enumerate(filenames, start=1):
    with open(filename, 'rb') as f:
        pdfReader = PyPDF2.PdfFileReader(f)
        pdfReader.numPages
        pageObj = pdfReader.getPage(0)
        mytext = pageObj.extractText()
    
    sheet[f'A{row}'].value = mytext
    sheet[f'B{row}'].value = filename

wb.save('excel.xlsx')
print('DONE!!')