Going crazy trying to need a number ID from each person in a pdf file.
The situation: in a pdf file, have a lot of people that received some money. i have to extract which ones received x money in a specific date.
i used cpf
id that looks like: 000.000.000-00
CPF is an identification document that has an unique number for each brazilian person.
The code is ok but when the name of person have more than 5 names, the ID called by CPF break a line, being like:
234.234.234-
23
and the ones who have their CPF's in this \n, cant be found because the regex don't cover it. i tried everything n nothing works.
I'm using this code in regex: r"\d{3}[\.]\d{3}[\.]\d{3}[-](\s?\d{0,2})"
Edit 1:
I realized that the problem wasn't in the regex but its in the text format received from the function.
The text are being collected like: ' 00,0 Benefício Saldo Conta Aldair Souza Lima) 143.230.234- Valor Mobilidade 12 '
The last 2 digits of cpf
are showing up in the end of the text string. I looked and debugged the code and seems like the line break in the PDF is causing all this trouble.
I changed the regex to find people by name but there's no name pattern cause they are so different.
I'm thinking in some way that i can make a regex to match: \d{3}[.]\d{3}[.]\d{3}[-]
than after N caracthers i match:
'\s\d\s'
(' 12 ' from the example) cause the last 2 digits always have this 2 blank espaces, one before and one after.
Is there some way that I can do it? Help me guys plz
CodePudding user response:
Specific to your 00,0 Benefício Saldo Conta Aldair Souza Lima) 143.230.234- Valor Mobilidade 12
example:
(\d{3}\.\d{3}\.\d{3}-)[^\d]*?(\d{2})
It first matches and captures the 000.000.000-
part: (\d{3}\.\d{3}\.\d{3}-)
Then matches but does not capture anything that's not digits: [^\d]*?
Then matches and captures two more digits: (\d{2})
Not the best implementation, since the results are returned in two separate groups, but hope this helps.
CodePudding user response:
You keep replacing the question thus delete the helpful comments. Many good suggestions should have worked by now, but you are not supplying a true minimal sample of input and output. thus the best comments so far should have got you past