Home > Software engineering >  regex extract text pdf
regex extract text pdf

Time:08-22

Going crazy trying to need a number ID from each person in a pdf file.

The situation: in a pdf file, have a lot of people that received some money. i have to extract which ones received x money in a specific date.

i used cpf id that looks like: 000.000.000-00

CPF is an identification document that has an unique number for each brazilian person.

The code is ok but when the name of person have more than 5 names, the ID called by CPF break a line, being like:

234.234.234-

23

and the ones who have their CPF's in this \n, cant be found because the regex don't cover it. i tried everything n nothing works.

I'm using this code in regex: r"\d{3}[\.]\d{3}[\.]\d{3}[-](\s?\d{0,2})"

Edit 1:

I realized that the problem wasn't in the regex but its in the text format received from the function.

The text are being collected like: ' 00,0 Benefício Saldo Conta Aldair Souza Lima) 143.230.234- Valor Mobilidade 12 '

The last 2 digits of cpf are showing up in the end of the text string. I looked and debugged the code and seems like the line break in the PDF is causing all this trouble.

I changed the regex to find people by name but there's no name pattern cause they are so different.

I'm thinking in some way that i can make a regex to match: \d{3}[.]\d{3}[.]\d{3}[-]

than after N caracthers i match:

'\s\d\s' (' 12 ' from the example) cause the last 2 digits always have this 2 blank espaces, one before and one after.

Is there some way that I can do it? Help me guys plz

CodePudding user response:

Specific to your 00,0 Benefício Saldo Conta Aldair Souza Lima) 143.230.234- Valor Mobilidade 12 example:

(\d{3}\.\d{3}\.\d{3}-)[^\d]*?(\d{2})

It first matches and captures the 000.000.000- part: (\d{3}\.\d{3}\.\d{3}-)

Then matches but does not capture anything that's not digits: [^\d]*?

Then matches and captures two more digits: (\d{2})

Not the best implementation, since the results are returned in two separate groups, but hope this helps.

CodePudding user response:

You keep replacing the question thus delete the helpful comments. Many good suggestions should have worked by now, but you are not supplying a true minimal sample of input and output. thus the best comments so far should have got you past

((\d{3}\.) \d{3}-)\D*(\d{2})\b enter image description here

  • Related