Extract Text from PDF using Python-CodePudding

Hi I am a python beginner.

I am trying to extract text from only few boxes in a pdf file

I used pytesseract library to extract the text but it is downloading all the text. I want to limit my text extraction to certain observations in the file such as FEI number, Date Of Inspection at the top and employees signature at the bottom, can someone please guide what packages can I use to do so, and how to do so .

the Code I am using is something I borrowed from internet:

from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
!apt-get install -y poppler-utils #installing poppler


def convert_pdf_to_img(pdf_file):
    """
    @desc: this function converts a PDF into Image
    
    @params:
        - pdf_file: the file to be converted
    
    @returns:
        - an interable containing image format of all the pages of the PDF
    """
    return convert_from_path(pdf_file)


def convert_image_to_text(file):
    """
    @desc: this function extracts text from image
    
    @params:
        - file: the image file to extract the content
    
    @returns:
        - the textual content of single image
    """
    
    text = image_to_string(file)
    return text


def get_text_from_any_pdf(pdf_file):
    """
    @desc: this function is our final system combining the previous functions
    
    @params:
        - file: the original PDF File
    
    @returns:
        - the textual content of ALL the pages
    """
    images = convert_pdf_to_img(pdf_file)
    final_text = ""
    for pg, img in enumerate(images):
        
        final_text  = convert_image_to_text(img)
        #print("Page n°{}".format(pg))
        #print(convert_image_to_text(img))
    
    return final_text

Kaggle link for my notebook

CodePudding user response：

I'm sure it is more efficient to crop the part of the images where you want the text to be extracted. And for that I'd use cv2 for image processing python module.

CodePudding user response：

pdfplumber's .extract_table() can help isolate the "boxes".

import pdfplumber

pdf = pdfplumber.open('fda.pdf')

page = pdf.pages[0]
table = page.extract_table()

for row in table:
    for col in row:
        if col and ('OF INSPECTION' in col or 'FEJNUMBER' in col or 'SIGNATURE' in col):
            col

Output:

'OATE(S)OF INSPECTION \n10/15/2018-10/25/2018*'
'FEJNUMBER \n123456789'
"EMPI.OYEE(S) SIGNATURE \nFirstName L  LastName,  Investigator \nFJenr«etL~-•L -\nX  o•  SIJ'led 1~2S-20ta06ot oc"

There are some issues with the accuracy of the text extraction - but from here each value you want is available in the 2nd line of each column.