Home > other >  How python from the contents of the PDF file (this includes text, images, curve)
How python from the contents of the PDF file (this includes text, images, curve)

Time:11-19

Although the problem of code are found online, but everyone can't run, I don't can edit the text in the PDF file (not encrypted) is not to come out, good quick ah, who is a great god know, help bai!!! Younger brother grateful ah, a string of code, but it is always an error, says the from pdfminer. Pdfparser import pdfparser, PDFDocument PDFDocument of import, and the from pdfminer. Pdfinterp import PDFTextExtractionNotAllowed this is, this is the two error, have two days, I couldn't find a solution, I am also drunk

 import sys 
The import importlib
Importlib. Reload (sys)

The from pdfminer. Pdfparser import pdfparser, PDFDocument
The from pdfminer. Pdfinterp import PDFResourceManager, PDFPageInterpreter
The from pdfminer. The converter import PDFPageAggregator
The from pdfminer. Layout import *
The from pdfminer. Pdfinterp import PDFTextExtractionNotAllowed

"'
Parsing the PDF file, file contains a variety of objects
"'


# PDF file parsing function
Def parse (pdf_path) :
Fp=open (pdf_path, 'rb') # read in binary mode to open the
# use the file object to create a PDF document analyzer
The parser=PDFParser (fp)
# to create a PDF document
Doc=PDFDocument ()
# connect analyzer with the document object
Parser. Set_document (doc)
Doc. Set_parser (parser)

# to provide initial password
# if no password, create an empty string
Doc. The initialize ()

# test for a provide TXT document conversion, omit it does not provide
If not doc. Is_extractable:
Raise PDFTextExtractionNotAllowed
The else:
# to create PDf resource manager to manage the Shared resource
RSRCMGR=PDFResourceManager ()
# to create a PDF device object
Laparams=laparams ()
Device=PDFPageAggregator (RSRCMGR, laparams=laparams)
# create a PDF interpreter object
Interpreter=PDFPageInterpreter (RSRCMGR, device)

# is used to count the page, pictures, curve, figure, horizontal text box the number of objects, such as
Num_page num_image, num_curve num_figure, num_TextBoxHorizontal=0, 0, 0, 0, 0

# iterate over the list, each dealing with a
the content of the pageFor page in doc. Get_pages () : # doc. Get_pages () to obtain a list page
Num_page +=1 # page to add a
Interpreter. Process_page (page)
# to accept the page LTPage object
Layout=device. Get_result ()
For x in layout:
If isinstance (x, LTImage) : # image object
Num_image +=1
If isinstance (x, LTCurve) : # curve object
Num_curve +=1
If isinstance (x, LTFigure) : # figure object
Num_figure +=1
If isinstance (x, LTTextBoxHorizontal) : # for text content
Num_TextBoxHorizontal +=1 # object level text box to add a
# save text content
With the open (r 'test. TXT', 'a') as f:
Results=x.g et_text ()
+ '\ n' f.w rite (results)
Print (' number of objects: \ n ', 'pages: % s \ n % num_page,' picture number: % s \ n % num_image, 'curve number: % s \ n % num_curve,' level text box: % s \ n '
% num_TextBoxHorizontal)


If __name__=="__main__ ':
Pdf_path=r 'C: \ Users \ fanyu PDF \ \ Desktop \ test. The PDF'
The parse (pdf_path)