I'm trying to read the 'cnpj' which is a number like this "30.114.117/0001-64" within a pdf file, so here's my script:
import re
import PyPDF2
import PySimpleGUI as sg
#GUI Window
Layout = [
[sg.Text("Por favor insira o diretório do seu PDF")],
[sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
[sg.Button("Extrair"),sg.Button("Cancelar")],
[sg.InputText("", key="Output")]]
Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))
while True:
evento, valores = Janela.read()
if evento == sg.WIN_CLOSED or evento == "Cancelar":
break
elif evento == "Extrair":
# OPEN PDF File
with open(valores["file_path"], 'rb') as f:
# Create a PDF object
pdf = PyPDF2.PdfFileReader(f)
# Iterates by every page from PDF
lista = []
for p in range(pdf.getNumPages()):
# get the number pages of pdf
texto = pdf.getPage(p).extractText()
# use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)
# print the cpf cnpj numbers that i found in the PDF
for cpf in cpfs:
lista.append(cpf ',')
for cnpj in cnpjs:
lista.append(cnpj ',')
Janela['Output'].update(lista)`
the script is okay, but in variable 'texto' may return a text jumping lines, like:
"your cnpj is 31.111.111
/0001-64"
when the line breaks the regex cant find the number, I also tried to
texto =texto.replace("\n", " ")
, but don't find anyway, anyone has an idea? maybe another library that can read.
I want to extract the CPF and CNPJ from pdf But the text break line and I cant extract the number
CodePudding user response:
I recommend using PyMuPDF. It has a number of flags used by its text extraction, among which is one to detect hyphenation. Your problem should go away if you extract like this with it:
import fitz # PyMuPDF import
doc = fitz.open("your.file")
page = doc[0] # page 0
text = page.get_text(flags=fitz.TEXT_DEHYPHENATE)
BTW All the above is not dependent on PDF files - also work for XPS, EPUB, and more.
CodePudding user response:
You're using PyPDF2 which is deprecated. Please move to pypdf
.
You haven't shared the PDF, hence it's impossible to check if that actually solves your issues. However, pypdf
did have quite a lot of text textraction improvements compared to PyPDF2<2.0.0
.
import re
import pypdf
import PySimpleGUI as sg
#GUI Window
Layout = [
[sg.Text("Por favor insira o diretório do seu PDF")],
[sg.Input(key='file_path'), sg.FileBrowse("Procurar")],
[sg.Button("Extrair"),sg.Button("Cancelar")],
[sg.InputText("", key="Output")]]
Janela = sg.Window("Extrator de CPF", Layout, margins=(100,50))
while True:
evento, valores = Janela.read()
if evento == sg.WIN_CLOSED or evento == "Cancelar":
break
elif evento == "Extrair":
# OPEN PDF File
with open(valores["file_path"], 'rb') as f:
# Create a PDF object
reader = pypdf.PdfReader(f)
# Iterates by every page from PDF
lista = []
for page in reader.pages:
# get the number pages of pdf
texto = page.extract_text()
# use a (regex) to look to cpf and cnpj (cpf = "123.456.789-10" and cnpj = '30.114.117/0001-64'
cnpjs = re.findall(r'\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}', texto)
cpfs = re.findall(r'\d{3}\.\d{3}\.\d{3}-\d{2}', texto)
# print the cpf cnpj numbers that i found in the PDF
for cpf in cpfs:
lista.append(cpf ',')
for cnpj in cnpjs:
lista.append(cnpj ',')
Janela['Output'].update(lista)