My script can't read a number that jump line-CodePudding

I'm trying to read the 'cnpj' which is a number like this "30.114.117/0001-64" within a pdf file, so here's my script:

import re
import PyPDF2
import PySimpleGUI as sg


#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))


while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            pdf = PyPDF2.PdfFileReader(f)

            # Iterates by every page from PDF
            lista = []
            for p in range(pdf.getNumPages()):

                # get the number pages of pdf
                texto = pdf.getPage(p).extractText()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf   ',')
                for cnpj in cnpjs:
                    lista.append(cnpj   ',')
        Janela['Output'].update(lista)`

the script is okay, but in variable 'texto' may return a text jumping lines, like:

"your cnpj is 31.111.111
/0001-64"

when the line breaks the regex cant find the number, I also tried to texto =texto.replace("\n", " "), but don't find anyway, anyone has an idea? maybe another library that can read.

I want to extract the CPF and CNPJ from pdf But the text break line and I cant extract the number

CodePudding user response：

I recommend using PyMuPDF. It has a number of flags used by its text extraction, among which is one to detect hyphenation. Your problem should go away if you extract like this with it:

import fitz # PyMuPDF import
doc = fitz.open("your.file")
page = doc[0]  # page 0

text = page.get_text(flags=fitz.TEXT_DEHYPHENATE)

BTW All the above is not dependent on PDF files - also work for XPS, EPUB, and more.

CodePudding user response：

You're using PyPDF2 which is deprecated. Please move to pypdf.

You haven't shared the PDF, hence it's impossible to check if that actually solves your issues. However, pypdf did have quite a lot of text textraction improvements compared to PyPDF2<2.0.0.

import re
import pypdf
import PySimpleGUI as sg


#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))


while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            reader = pypdf.PdfReader(f)

            # Iterates by every page from PDF
            lista = []
            for page in reader.pages:

                # get the number pages of pdf
                texto = page.extract_text()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf   ',')
                for cnpj in cnpjs:
                    lista.append(cnpj   ',')
        Janela['Output'].update(lista)