negate match pattern with contains a substring using regex-CodePudding

I need to get the CNPJ, but, ignore the match if string contains INTERESSADO. I need to do this using only regex, can't use if condition.

I thought that this patterns will works: (?!interessado: ).* CNPJ\W (?P<cnpj>\d \.\d \.\d \s?\/\d -\s?\d )

RECLAMADO: FOO LTDA - CNPJ: 99.999.999/9999-99
RECLAMADO: FOO FOO LTDA - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO
TERCEIRO INTERESSADO: FOO FOO FOO FOO - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO IT'S A TEST - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: TEST
INTERESSADO: TEST - CNPJ: 99.999.999/9999-99

In this case, my pattern match all lines, but i need only the RECLAMADO and `INTERSSADO CNPJ's.

I'm using regex101 to test the patterns.

obs: i'm using regex .i and .s flags.

CodePudding user response：

get the CNPJ, but, ignore the match if string contains INTERESSADO

That is, accepting the lines with CNPJ, and rejecting the lines with INTERESSADO.

There might be several ways to accomplish it. One of them is employ the notion of lookahead and lookbehind. (More Details)

In short, what we need to do is rejecting INTERESSADO with (?!...) while processing each line. Here is a simple demo for your case, shown below.

Codes:

import re

str = """RECLAMADO: FOO LTDA - CNPJ: 99.999.999/9999-99
RECLAMADO: FOO FOO LTDA - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO
TERCEIRO INTERESSADO: FOO FOO FOO FOO - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: FOO FOO IT'S A TEST - CNPJ: 99.999.999/9999-99
TERCEIRO INTERESSADO: TEST
INTERESSADO: TEST - CNPJ: 99.999.999/9999-99"""

lines = str.split("\n")
regex = "^((?<!interessado).(?!interessado))*(cnpj)[^0-9./-]*(?P<cnpjvalue>[0-9./-]*)$"
ptn = re.compile(regex,re.I|re.S)   # re.I for .i flag  ;  re.S for .s flag
for l in lines:
    m = ptn.match(l)
    if m:
        print("(Matched) cnpjvalue is "   m.group("cnpjvalue"))
    else:
        print("(Ignored)   ... ")

Output:

(Matched) cnpjvalue is 99.999.999/9999-99
(Matched) cnpjvalue is 99.999.999/9999-99
(Ignored)   ... 
(Ignored)   ... 
(Ignored)   ... 
(Ignored)   ... 
(Ignored)   ...

CodePudding user response：

You can use

import re
text = "RECLAMADO: FOO LTDA - CNPJ: 99.999.999/9999-99\nRECLAMADO: FOO FOO LTDA - CNPJ: 99.999.999/9999-99\nTERCEIRO INTERESSADO: FOO FOO\nTERCEIRO INTERESSADO: FOO FOO FOO FOO - CNPJ: 99.999.999/9999-99\nTERCEIRO INTERESSADO: FOO FOO IT'S A TEST - CNPJ: 99.999.999/9999-99\nTERCEIRO INTERESSADO: TEST\nINTERESSADO: TEST - CNPJ: 99.999.999/9999-99"
print( re.findall(r'^(?!.*interessado: ).* CNPJ\W (\d \.\d \.\d \s?\/\d -\s?\d )', text, re.M | re.I) )

See the Python demo. Output:

['99.999.999/9999-99', '99.999.999/9999-99']

See the regex demo. Details:

^ - start of a line (due to re.M)
(?!.*interessado: ) - only go on matching if there is no interessado: space on the line
.* - any zero or more chars other than line break chars as many as possible
- a space
CNPJ - a fixed string
\W - one or more non-word chars (may match across lines! If you do not need it, use [^\w\r\n] )
(\d \.\d \.\d \s?\/\d -\s?\d ) - Group 1: your return value, one or more digits and . twice, then one or more digits, an optional whitespace, /, 1 or more digits, -, an optional whitespace, one or more digits.

Note that \s matches line breaks, so use [^\S\n] / [^\S\r\n] to match horizontal whitespace only.