How python from the contents of the PDF file (this includes text, images, curve)-CodePudding

Although the problem of code are found online, but everyone can't run, I don't can edit the text in the PDF file (not encrypted) is not to come out, good quick ah, who is a great god know, help bai!!! Younger brother grateful ah, a string of code, but it is always an error, says the from pdfminer. Pdfparser import pdfparser, PDFDocument PDFDocument of import, and the from pdfminer. Pdfinterp import PDFTextExtractionNotAllowed this is, this is the two error, have two days, I couldn't find a solution, I am also drunk

 import sys 
The import importlib 
Importlib. Reload (sys) 

The from pdfminer. Pdfparser import pdfparser, PDFDocument 
The from pdfminer. Pdfinterp import PDFResourceManager, PDFPageInterpreter 
The from pdfminer. The converter import PDFPageAggregator 
The from pdfminer. Layout import * 
The from pdfminer. Pdfinterp import PDFTextExtractionNotAllowed 

"' 
Parsing the PDF file, file contains a variety of objects 
"' 


# PDF file parsing function 
Def parse (pdf_path) : 
Fp=open (pdf_path, 'rb') # read in binary mode to open the 
# use the file object to create a PDF document analyzer 
The parser=PDFParser (fp) 
# to create a PDF document 
Doc=PDFDocument () 
# connect analyzer with the document object 
Parser. Set_document (doc) 
Doc. Set_parser (parser) 

# to provide initial password 
# if no password, create an empty string 
Doc. The initialize () 

# test for a provide TXT document conversion, omit it does not provide 
If not doc. Is_extractable: 
Raise PDFTextExtractionNotAllowed 
The else: 
# to create PDf resource manager to manage the Shared resource 
RSRCMGR=PDFResourceManager () 
# to create a PDF device object 
Laparams=laparams () 
Device=PDFPageAggregator (RSRCMGR, laparams=laparams) 
# create a PDF interpreter object 
Interpreter=PDFPageInterpreter (RSRCMGR, device) 

# is used to count the page, pictures, curve, figure, horizontal text box the number of objects, such as 
Num_page num_image, num_curve num_figure, num_TextBoxHorizontal=0, 0, 0, 0, 0 

# iterate over the list, each dealing with a 
 the content of the pageFor page in doc. Get_pages () : # doc. Get_pages () to obtain a list page 
Num_page +=1 # page to add a 
Interpreter. Process_page (page) 
# to accept the page LTPage object 
Layout=device. Get_result () 
For x in layout: 
If isinstance (x, LTImage) : # image object 
Num_image +=1 
If isinstance (x, LTCurve) : # curve object 
Num_curve +=1 
If isinstance (x, LTFigure) : # figure object 
Num_figure +=1 
If isinstance (x, LTTextBoxHorizontal) : # for text content 
Num_TextBoxHorizontal +=1 # object level text box to add a 
# save text content 
With the open (r 'test. TXT', 'a') as f: 
Results=x.g et_text () 
+ '\ n' f.w rite (results) 
Print (' number of objects: \ n ', 'pages: % s \ n % num_page,' picture number: % s \ n % num_image, 'curve number: % s \ n % num_curve,' level text box: % s \ n '
% num_TextBoxHorizontal) 


If __name__=="__main__ ': 
Pdf_path=r 'C: \ Users \ fanyu PDF \ \ Desktop \ test. The PDF' 
The parse (pdf_path)