Home > other >  Small white document traversal and excel storage problem
Small white document traversal and excel storage problem

Time:10-09

1. Purpose: to use the Python folder all the PDF files the text in the crawl into an excel
All the PDF with the same format

2. The problems need to be solved: iterate through all the PDF files in the specified folder, and keep the text in the excel in a row, the contents of a file stored in a new line,

Below is the current code, now you can grab a specified PDF text to excel, but need to solve the document traversal and excel boc number + 1 line write the content of the new file,


Thank humbly!!!!!

============================================

Import pdfplumber # parse PDF files, particularly with the form of file
The from openpyxl import Workbook # to read and write Excel file

Def parse (PDF) :
The targets=[] # save results,
For page in PDF. Pages:
Words=page. Extract_words (x_tolerance=5)
For the word in words:
The targets. Append (word [' text '])
Return the targets
# print (targets)

# save
Def save (the targets, out_path, sheet_name='targets') :
Wb=Workbook ()
Ws=wb. Active
Ws. The title=sheet_name
Ws. Append (targets)
Print (ws)
# ws. Append (the list (the targets. The values ()))
Wb. Save (out_path)


# main function entry
If __name__=="__main__" :
Print (__doc__)
Path="c:/l tax01. PDF '
Out_path='c:/PDF_Inf - 2. XLSX'
PDF=pdfplumber. Open (path)
The targets=parse (PDF)
Save (the targets, out_path)
Print (' run over! ')

CodePudding user response:

Python has a function of the OS can traverse all files directory

CodePudding user response:


To solve, thank you! @ JMZL.

==========================================

Import pdfplumber # parse PDF files, particularly with the form of file
The from openpyxl import Workbook # to read and write Excel file
The import OS

Def parse (PDF) :
The targets=[] # save results,
For page in PDF. Pages:
Words=page. Extract_words (x_tolerance=5)
For the word in words:
The targets. Append (word [' text '])
Return the targets
# print (targets)

# save
Def save (the targets, out_path, sheet_name='targets') :
Wb=Workbook ()
Ws=wb. Active
Ws. The title=sheet_name
Ws. Append (targets)
# print (ws)
# ws. Append (the list (the targets. The values ()))
Wb. Save (out_path)


# main function entry
If __name__=="__main__" :
Print (__doc__)
Path='output'
Excelnumb=1
Files=OS. Listdir (path)
# out_path='PDF_Inf - 2. XLSX'

For file in files:
PDF=pdfplumber. Open (path + "/" + file)
The targets=parse (PDF)
Save (the targets' % s.x LSX '% file [: - 4])
Excelnumb +=1
Print (' run over! ')
  • Related