Home > other >  Python3 using PDFMiner read PDF file how to preserve LTImage types namely how to save the picture
Python3 using PDFMiner read PDF file how to preserve LTImage types namely how to save the picture

Time:09-27


Python USES PDFMiner resolution PDF
One type of LTFigure
Now know can be extracted from LTfigure LTImage the picture of type
Consult, LTImage types namely how to save the picture

CodePudding user response:


See didn't helpful: https://www.jianshu.com/p/938763947de3

CodePudding user response:


Def parse_lt_objs (lt_objs, page_number images_folder, text=[]) :
# Iterate through the list of LT * objects and capture the text or image data contained in each#
Text_content=[]
For lt_obj lt_objs in:
If isinstance (lt_obj LTTextBox) or isinstance (lt_obj, LTTextLine) :
# text
Text_content. Append (lt_obj get_text ())
Elif isinstance (lt_obj LTImage) :
# text_content. Append (' & lt; Img src=https://bbs.csdn.net/topics/tt "/& gt; ')
# an image, so save it to the designated folder, and note it 's place in the text
Saved_file=save_image (lt_obj page_number, images_folder)
If saved_file:
Use HTML style & lt; Img/& gt; The tag to mark the position of the image within the text
Text_content. Append (' & lt; Img SRC="' + OS. Path. Join (images_folder saved_file) + '"/& gt; ')
The else:
Print & gt;> Sys. Stderr, "Error saving image on page", page_number, lt_obj. __repr__
Elif isinstance (lt_obj LTFigure) :
LTFigure objects are containers for other LT * objects, so recurse through the children
Text_content. Append (' & lt; Figure src=https://bbs.csdn.net/topics/tt "/& gt; ')
Text_content. Append (parse_lt_objs (lt_obj objs, page_number, images_folder, text_content)) # this sentence error, do you know why? Said lt_obj no objs attribute
Return '\ n'. Join (text_content)
Def save_image (lt_image, page_number images_folder) :
# the Try to save the image data from this LTImage object, and return the file name, the if successful#
Result=None
If lt_image. Stream:
File_stream=lt_image. Stream. Get_rawdata ()
File_ext=determine_image_type (file_stream [4-0])
If file_ext:
File_name="'. Join ([STR (page_number), '_', lt_image. Name, file_ext])
If write_file (images_folder, file_name, lt_image. Stream. Get_rawdata (), flags='wb') :
Result=file_name
Return the result
Def determine_image_type (stream_first_4_bytes) :
#, Find out the image file type -based on the magic number comparison of the first four (or 2) bytes#
File_type=None
Bytes_as_hex=b2a_hex (stream_first_4_bytes)
If bytes_as_hex. Startswith (' ffd8) :
File_type='jpeg'
Elif bytes_as_hex=='89504 e47:
File_type=', PNG '
Elif bytes_as_hex=='47494638' :
File_type='GIF'
Elif bytes_as_hex. Startswith (' 424 - d) :
File_type='. BMP '
Return file_type
Def write_file (folder, filename, filedata, flags='w') :
# for the file data to the folder and filename combination
# (flags: 'w' for the write text, 'wb for the write binary, use' a 'home' w 'for append) #
Result=False
If OS. Path. Isdir (folder) :
Try:
File_obj=open (OS) path) join (folder, filename), flags)
File_obj. Write (filedata)
File_obj. Close ()
Result=True
Except IOError:
Pass
Return the result
According to the document for this should be

CodePudding user response:


Have been abandoned, thank you very much

CodePudding user response:

Could you tell me how to extract LTImage types from LTfigure images?
  • Related