All the PDF with the same format
2. The problems need to be solved: iterate through all the PDF files in the specified folder, and keep the text in the excel in a row, the contents of a file stored in a new line,
Below is the current code, now you can grab a specified PDF text to excel, but need to solve the document traversal and excel boc number + 1 line write the content of the new file,
Thank humbly!!!!!
============================================
Import pdfplumber # parse PDF files, particularly with the form of file
The from openpyxl import Workbook # to read and write Excel file
Def parse (PDF) :
The targets=[] # save results,
For page in PDF. Pages:
Words=page. Extract_words (x_tolerance=5)
For the word in words:
The targets. Append (word [' text '])
Return the targets
# print (targets)
# save
Def save (the targets, out_path, sheet_name='targets') :
Wb=Workbook ()
Ws=wb. Active
Ws. The title=sheet_name
Ws. Append (targets)
Print (ws)
# ws. Append (the list (the targets. The values ()))
Wb. Save (out_path)
# main function entry
If __name__=="__main__" :
Print (__doc__)
Path="c:/l tax01. PDF '
Out_path='c:/PDF_Inf - 2. XLSX'
PDF=pdfplumber. Open (path)
The targets=parse (PDF)
Save (the targets, out_path)
Print (' run over! ')
CodePudding user response:
Python has a function of the OS can traverse all files directoryCodePudding user response:
To solve, thank you! @ JMZL.
==========================================
Import pdfplumber # parse PDF files, particularly with the form of file
The from openpyxl import Workbook # to read and write Excel file
The import OS
Def parse (PDF) :
The targets=[] # save results,
For page in PDF. Pages:
Words=page. Extract_words (x_tolerance=5)
For the word in words:
The targets. Append (word [' text '])
Return the targets
# print (targets)
# save
Def save (the targets, out_path, sheet_name='targets') :
Wb=Workbook ()
Ws=wb. Active
Ws. The title=sheet_name
Ws. Append (targets)
# print (ws)
# ws. Append (the list (the targets. The values ()))
Wb. Save (out_path)
# main function entry
If __name__=="__main__" :
Print (__doc__)
Path='output'
Excelnumb=1
Files=OS. Listdir (path)
# out_path='PDF_Inf - 2. XLSX'
For file in files:
PDF=pdfplumber. Open (path + "/" + file)
The targets=parse (PDF)
Save (the targets' % s.x LSX '% file [: - 4])
Excelnumb +=1
Print (' run over! ')