Home > Software design >  My script can't read a number that jump line
My script can't read a number that jump line

Time:01-18

I'm trying to read the 'cnpj' which is a number like this "30.114.117/0001-64" within a pdf file, so here's my script:

import re
import PyPDF2
import PySimpleGUI as sg


#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))


while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            pdf = PyPDF2.PdfFileReader(f)

            # Iterates by every page from PDF
            lista = []
            for p in range(pdf.getNumPages()):

                # get the number pages of pdf
                texto = pdf.getPage(p).extractText()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf   ',')
                for cnpj in cnpjs:
                    lista.append(cnpj   ',')
        Janela['Output'].update(lista)`

the script is okay, but in variable 'texto' may return a text jumping lines, like:

"your cnpj is 31.111.111
/0001-64"

when the line breaks the regex cant find the number, I also tried to texto =texto.replace("\n", " "), but don't find anyway, anyone has an idea? maybe another library that can read.

I want to extract the CPF and CNPJ from pdf But the text break line and I cant extract the number

CodePudding user response:

I recommend using PyMuPDF. It has a number of flags used by its text extraction, among which is one to detect hyphenation. Your problem should go away if you extract like this with it:

import fitz # PyMuPDF import
doc = fitz.open("your.file")
page = doc[0]  # page 0

text = page.get_text(flags=fitz.TEXT_DEHYPHENATE)

BTW All the above is not dependent on PDF files - also work for XPS, EPUB, and more.

CodePudding user response:

You're using PyPDF2 which is deprecated. Please move to pypdf.

You haven't shared the PDF, hence it's impossible to check if that actually solves your issues. However, pypdf did have quite a lot of text textraction improvements compared to PyPDF2<2.0.0.

import re
import pypdf
import PySimpleGUI as sg


#GUI Window
Layout = [
    [sg.Text("Por favor insira o diretório do seu PDF")],
    [sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
    [sg.Button("Extrair"),sg.Button("Cancelar")],
    [sg.InputText("", key="Output")]]

Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))


while True:
    evento, valores = Janela.read()
    if evento == sg.WIN_CLOSED or evento == "Cancelar":
        break
    elif evento == "Extrair":

        # OPEN PDF File
        with open(valores["file_path"], 'rb') as f:

            # Create a PDF object
            reader = pypdf.PdfReader(f)

            # Iterates by every page from PDF
            lista = []
            for page in reader.pages:

                # get the number pages of pdf
                texto = page.extract_text()

                # use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
                cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
                cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)

                # print the cpf cnpj numbers that i found in the PDF
                for cpf in cpfs:
                     lista.append(cpf   ',')
                for cnpj in cnpjs:
                    lista.append(cnpj   ',')
        Janela['Output'].update(lista)
  • Related