pdfplumber | Extract text from dynamic column layouts-CodePudding

Attempted Solution at bottom of post.

I have near-working code that extracts the sentence containing a phrase, across multiple lines.

However, some pages have columns. So respective outputs are incorrect; where separate texts are wrongly merged together as a bad sentence.

This problem has been addressed in the following posts:

Question:

How do I "if-condition" whether there are columns?

Pages may not have columns,
Pages may have more than 2 columns.
Pages may also have headers and footers (that can be left out).

Example .pdf with dynamic text layout: PDF (pg. 2).

Jupyter Notebook:

# pip install PyPDF2
# pip install pdfplumber

# ---

import pdfplumber

# ---

def scrape_sentence(phrase, lines, index):
    # -- Gather sentence 'phrase' occurs in --
    sentence = lines[index]
    print("-- sentence --", sentence)
    print("len(lines)", len(lines))
    
    # Previous lines
    pre_i, flag = index, 0
    while flag == 0:
        pre_i -= 1
        if pre_i <= 0:
            break
            
        sentence = lines[pre_i]   sentence
        
        if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or '  •  ' in lines[pre_i]:
            flag == 1
    
    print("\n", sentence)
    
    # Following lines
    post_i, flag = index, 0
    while flag == 0:
        post_i  = 1
        if post_i >= len(lines):
            break
            
        sentence = sentence   lines[post_i] 
        
        if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or '  •  ' in lines[pre_i]:
            flag == 1 
    
    print("\n", sentence)
    
    # -- Extract --
    sentence = sentence.replace('!', '.')
    sentence = sentence.replace('?', '.')
    sentence = sentence.split('.')
    sentence = [s for s in sentence if phrase in s]
    print(sentence)
    sentence = sentence[0].replace('\n', '').strip()  # first occurance
    print(sentence)
    
    return sentence

# ---

phrase = 'Gulf Petrochemical Industries Company'

with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
    for page in opened_pdf.pages:
        text = page.extract_text()
        if text == None:
            continue
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if phrase in lines[i]:
                sentence = scrape_sentence(phrase, lines, i)
            i  = 1

Example Incorrect Output:

-- sentence -- being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of 
len(lines) 47

 Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of 

 Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption. represented by natural gas purchases, empowering bahraini nationals through training & employment, utilisation of local contractors and suppliers, energy consumption and other financial, commercial, environmental and social activities that arise as a part of our core operations within the kingdom.GPIC becomes an organizational stakeholder of Global Reporting for the purpose of clarity throughout this report,  Initiative ( GRI) in 2014. By supporting GRI, Organizational ‘gpic’, ’we’ ‘us’, and ‘our’ refer to the gulf  Stakeholders (OS) like GPIC, demonstrate their commitment to transparency, accountability and sustainability to a worldwide petrochemical industries company; ‘sabic’ refers to network of multi-stakeholders.the saudi basic industries corporation; ‘pic’ refers to the petrochemical industries company, kuwait; ‘nogaholding’ refers to the oil and gas holding company, kingdom of bahrain; and ‘board’ refers to our board of directors represented by a group formed by nogaholding, sabic and pic.the oil and gas holding company (nogaholding) is  GPIC is a Responsible Care Company certified for RC 14001 since July 2010. We are committed to the safe, ethical and the business and investment arm of noga (national environmentally sound management of the petrochemicals oil and gas authority) and steward of the bahrain  and fertilizers we make and export. Stakeholders’ well-being is government’s investment in the bahrain petroleum  always a key priority at GPIC.company (bapco), the bahrain national gas company (banagas), the bahrain national gas expansion company (bngec), the bahrain aviation fuelling company (bafco), the bahrain lube base oil company, the gulf petrochemical industries company (gpic), and tatweer petroleum.GPIC SuStaInabIlIty RePoRt 2016 01ii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
[' being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption']
being a major manufacturer within the kingdom of  In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption

...

Attempted Minimal Solution: This will separate text into 2 columns; regardless if there are 2.

# pip install PyPDF2
# pip install pdfplumber

# ---

import pdfplumber
import decimal

# ---

with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
    for page in opened_pdf.pages:
        left = page.crop((0, 0, decimal.Decimal(0.5) * page.width, decimal.Decimal(0.9) * page.height))
        right = page.crop((decimal.Decimal(0.5) * page.width, 0, page.width, page.height))
        
        l_text = left.extract_text()
        r_text = right.extract_text()
        print("\n -- l_text --", l_text)
        print("\n -- r_text --", r_text)
        text = str(l_text)   " "   str(r_text)

Please let me know if there is anything else I should clarify.

CodePudding user response：

This answer enables you to scrape text, in the intended order.

Towards Data Science article PDF Text Extraction in Python:

Compared with PyPDF2, PDFMiner’s scope is much more limited, it really focuses only on extracting the text from the source information of a pdf file.

from io import StringIO

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_string(file_path):
    output_string = StringIO()
    with open(file_path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    return(output_string.getvalue())

file_path = ''  # !
text = convert_pdf_to_string(file_path)
print(text)

Cleansing can be applied thereafter.