How to select string rows extracted from pdf starting from a specific row that meets a condition-CodePudding

I'm using Python. I have many pdftexts and I have used pdfminer to extract and arrange the elements I'm interested in (LTchar objects). What I need now is to isolate text from a starting row until the end.

Consider the pdf attached to this thread. In the last page there's a row that says "Tabla 1. Indicadores...". This is the starting row I'm interested in. I need to extract all rows that there are from row "Tabla 1" until the end.

The example pdf can be found here:

https://drive.google.com/file/d/1h1eiCpP7ipefv1LPwsmv-8PfNgT_Myfw/view?usp=sharing

My code up to now is the following:

# Import libraries

import pdfminer
from io import StringIO
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator, TextConverter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.layout import LTPage, LTTextBoxHorizontal, LTTextBoxVertical, LTTextLineHorizontal, LTTextLineVertical, LTTextContainer, LTChar, LTText, LTTextBox, LTAnno
from pdfquery import PDFQuery
import matplotlib.pyplot as plt
from matplotlib import patches
%matplotlib inline
import pandas as pd
import numpy as np
import itertools

# Iterate util I get LTChar objects

fp = open('example.pdf', 'rb')
manager = PDFResourceManager()
laparams = LAParams()
dev = PDFPageAggregator(manager, laparams=laparams)
interpreter = PDFPageInterpreter(manager, dev)
pages = PDFPage.get_pages(fp)

for page in pages:
    interpreter.process_page(page)
    layout = dev.get_result()
    characters = []
    for textbox in layout:
        if isinstance(textbox, LTTextBox):
            for line in textbox:
                if isinstance(line, LTTextLineHorizontal):
                    for char in line:
                      if isinstance(char, LTChar):
                        if char.get_text() != ' ':
                          characters.append(char)

# Arrange characters

def arrange_text(x):
    rows = sorted(list(set(c.bbox[1] for c in x)), reverse=True)
    sorted_rows = []
    for row in rows:
        sorted_row = sorted([c for c in x if c.bbox[1] == row], key=lambda c: c.bbox[0])
        sorted_rows.append(sorted_row)
    return sorted_rows

sorted_rows = arrange_text(characters)

# Get text lines

rows = sorted_rows
for row in rows:          
  text_line = "".join([c.get_text() for c in row])
  print(text_line)

Current output is

considerarestosnuevosenfoquesyexperienciasparaunaintervenciónsostenibleque
generemayorbienestar.
•
Sinembargo,Collinsetal.,(1999)adviertenqueelconocerelcontextolocaladetallees
necesarioparaunacorrectaimplementación.Enesesentido,losmodeloscomolaTriple
HéliceyexperienciascomoelProgramaBioculturadebenconsiderarse,perono
importarsedeformaacrítica.Enesesentido,esnecesarioquelaComisiónconvocada
paralaENDERevalúecondetalleelcontextolocal,yevite,enloposible,intervenciones
decarácteruniversalista.
Tabla1.Indicadoressocioeconómicosenlosámbitosurbanoyrural
IndicadorUrbanoRuralAño
Porcentajedeasistenciaescolara85.480.72021
educaciónsecundaria*
Porcentajedepoblaciónconaccesoa87222020-2021
unaredpúblicadealcantarillado*
Porcentajedepoblaciónconaccesoa47.925.82018
algúnserviciofinanciero**
Ingresopromediomensualensoles**1557.4711.42018
Porcentajedepoblaciónpobrepor26.645.72020
áreaderesidencia***
Fuente:INEI(2019,2021a,2021b)
*ObtenidodeINEI(2021a)
**ObtenidodeINEI(2019)
***ObtenidodeINEI(2021b)

Expected output is

Tabla1.Indicadoressocioeconómicosenlosámbitosurbanoyrural
IndicadorUrbanoRuralAño
Porcentajedeasistenciaescolara85.480.72021
educaciónsecundaria*
Porcentajedepoblaciónconaccesoa87222020-2021
unaredpúblicadealcantarillado*
Porcentajedepoblaciónconaccesoa47.925.82018
algúnserviciofinanciero**
Ingresopromediomensualensoles**1557.4711.42018
Porcentajedepoblaciónpobrepor26.645.72020
áreaderesidencia***
Fuente:INEI(2019,2021a,2021b)
*ObtenidodeINEI(2021a)
**ObtenidodeINEI(2019)
***ObtenidodeINEI(2021b)

It should be noted that I have many pdfs and they all have the same structure. The starting row always starts with the words "Tabla 1." In the same row, the part of "Indicadores socioeconómicos..." vary across the pdfs. This is why I think I need to use the function startswith(). However, if I use only this function, I just get the row of "Tabla 1." and nothing else while I'm interested in all rows that come before "Tabla 1" row. In all cases, the condition is that the starting row starts with characters "Tabla 1.". Any suggestions?

CodePudding user response：

You should first create one string with all text_line and later search Tabla 1, and slice it text[start:], and later print it

# --- before loop ---

rows = sorted_rows

lines = []

# --- loop ---

for row in rows:          
    text_line = "".join([c.get_text() for c in row])
    lines.append(text_line)
  
# --- after loop ---

text = "\n".join(lines)

start = text.find('Tabla 1')

print(text[start:])

If you want to make sure that Tabla 1 is at the beginning of line then you can search with \n - \nTabla 1 - and later skip it using [start 1:]

If you may have may parts with Tabla 1 then you can text.split('\nTabla 1') to have list of parts - and later you will have to add Tabla 1 at the beginning of every part.

parts = text.split('\nTabla 1')

parts = parts[1:]  # skip first part (before first `Tabla 1`)

parts = ['Tabla 1' text for text in parts]

for text in parts:
    print(text)
    print('---')

Or you may try to use regex to search text between two \nTabla 1