I need to get the CNPJ, but, ignore the match if string contains INTERESSADO
. I need to do this using only regex, can't use if condition.
I thought that this patterns will works: (?!interessado: ).* CNPJ\W (?P<cnpj>\d \.\d \.\d \s?\/\d -\s?\d )
RECLAMADO: FOO LTDA - CNPJ: 99.999.999/9999-99
RECLAMADO: FOO FOO LTDA - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO
TERCEIRO INTERESSADO: FOO FOO FOO FOO - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO IT'S A TEST - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: TEST
INTERESSADO: TEST - CNPJ: 99.999.999/9999-99
In this case, my pattern match all lines, but i need only the RECLAMADO
and `INTERSSADO CNPJ's.
I'm using regex101 to test the patterns.
obs: i'm using regex .i and .s flags.
CodePudding user response:
get the
CNPJ
, but, ignore the match if string containsINTERESSADO
That is, accepting the lines with CNPJ
, and rejecting the lines with INTERESSADO
.
There might be several ways to accomplish it. One of them is employ the notion of lookahead and lookbehind. (More Details)
In short, what we need to do is rejecting INTERESSADO
with (?!...)
while processing each line. Here is a simple demo for your case, shown below.
Codes:
import re
str = """RECLAMADO: FOO LTDA - CNPJ: 99.999.999/9999-99
RECLAMADO: FOO FOO LTDA - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO
TERCEIRO INTERESSADO: FOO FOO FOO FOO - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO IT'S A TEST - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: TEST
INTERESSADO: TEST - CNPJ: 99.999.999/9999-99"""
lines = str.split("\n")
regex = "^((?<!interessado).(?!interessado))*(cnpj)[^0-9./-]*(?P<cnpjvalue>[0-9./-]*)$"
ptn = re.compile(regex,re.I|re.S) # re.I for .i flag ; re.S for .s flag
for l in lines:
m = ptn.match(l)
if m:
print("(Matched) cnpjvalue is " m.group("cnpjvalue"))
else:
print("(Ignored) ... ")
Output:
(Matched) cnpjvalue is 99.999.999/9999-99
(Matched) cnpjvalue is 99.999.999/9999-99
(Ignored) ...
(Ignored) ...
(Ignored) ...
(Ignored) ...
(Ignored) ...
CodePudding user response:
You can use
import re
text = "RECLAMADO: FOO LTDA - CNPJ: 99.999.999/9999-99\nRECLAMADO: FOO FOO LTDA - CNPJ: 99.999.999/9999-99\nTERCEIRO INTERESSADO: FOO FOO\nTERCEIRO INTERESSADO: FOO FOO FOO FOO - CNPJ: 99.999.999/9999-99\nTERCEIRO INTERESSADO: FOO FOO IT'S A TEST - CNPJ: 99.999.999/9999-99\nTERCEIRO INTERESSADO: TEST\nINTERESSADO: TEST - CNPJ: 99.999.999/9999-99"
print( re.findall(r'^(?!.*interessado: ).* CNPJ\W (\d \.\d \.\d \s?\/\d -\s?\d )', text, re.M | re.I) )
See the Python demo. Output:
['99.999.999/9999-99', '99.999.999/9999-99']
See the regex demo. Details:
^
- start of a line (due tore.M
)(?!.*interessado: )
- only go on matching if there is nointeressado:
space on the line.*
- any zero or more chars other than line break chars as many as possibleCNPJ
- a fixed string\W
- one or more non-word chars (may match across lines! If you do not need it, use[^\w\r\n]
)(\d \.\d \.\d \s?\/\d -\s?\d )
- Group 1: your return value, one or more digits and.
twice, then one or more digits, an optional whitespace,/
, 1 or more digits,-
, an optional whitespace, one or more digits.
Note that \s
matches line breaks, so use [^\S\n]
/ [^\S\r\n]
to match horizontal whitespace only.