Home > Enterprise >  getting specific groups of patterns inside a block text
getting specific groups of patterns inside a block text

Time:10-21

I'm trying to get this values - 10.547.889/0001-85, 00.219.460/0001-05 separated by groups, but the condition is that the pattern need start with executada(s):, can't be something like: r' - CNPJ:? (?P<cnpj>\d \.\d \.\d \/\d -\d )'. So, the idea is start in executada(s) and get this groups.

Currently, my pattern just get the first group, I don't know how to get all them.

I'm using Python 3.8.5 and regex lib(doesn't re).

text = """
Solicite-se ao BANCO CENTRAL, via protocolo digital - SISBACEN ,
o BLOQUEIO de créditos existentes até o limite de R$ 30.257,45 (trinta mil, duzentos e
cinquenta e sete reais e quarenta e cinco centavos) da(s) executada(s): J.HENRIQUE
GALVANI COMERCIO DE ROUPAS - ME - CNPJ 10.547.889/0001-85, Riane Confecções de
Roupas Ltda - ME - CNPJ: 00.219.460/0001-05, Jose Henrique Galvani - CPF: 234.846.406-34
e Heliane Leonel Raymundo Galvani - CPF: 813.460.347-53, porventura
existentes junto a instituições financeiras, incluindo cartões de crédito, agenciadores
de pagamento, administradores de consórcio."""

pattern = r'executad\w(?:\(s\))?\W (?:[\p{L}\s\-\.] CNPJ\W (?P<cnpj>\d \.\d \.\d \/\d -\d ),) '

for item in regex.finditer(pattern, text, flags=regex.I|regex.S):
    print(item.groupdict())

{'cnpj': '00.219.460/0001-05'}

I was waiting for:

{'cnpj': '00.219.460/0001-05'}

{'cnpj': '10.547.889/0001-85'}

So, can someone help me with this trouble?

CodePudding user response:

Using the regex module, you could make use of the \G anchor:

(?:executad\w(?:\(s\))?\W |\G(?!^)),?[\p{L}\s.-] CNPJ\W \K(?P<cnpj>\d \.\d \.\d /\d -\d )

In parts, the pattern matches:

  • (?: Non capture group
    • executad\w Match executad, a word char (which could also be an a char if that is the only possibility)
    • (?:\(s\))?\W Optionally match (s) and 1 non word chars
    • | Or
    • \G(?!^) Assert the current postion at the end of the previous match, but not at the start of the string
  • ) Close non capture group
  • ,?[\p{L}\s.-] Match an optional , and 1 times any letter, whitespace char, . or -
  • CNPJ\W Match CNPJ and 1 times non word chars
  • \K Clear the match buffer to forget what is matched so far
  • (?P<cnpj>\d \.\d \.\d /\d -\d ) Named group cnpj, capture the desired format

Regex demo | Python demo

For the example data, you can omit the regex.S flag as \W also matches a newline.

import regex

pattern = r"(?:executad\w(?:\(s\))?\W |\G(?!^)),?[\p{L}\s.-] CNPJ\W \K(?P<cnpj>\d \.\d \.\d /\d -\d )"

text = ("Solicite-se ao BANCO CENTRAL, via protocolo digital - SISBACEN ,\n"
    "o BLOQUEIO de créditos existentes até o limite de R$ 30.257,45 (trinta mil, duzentos e\n"
    "cinquenta e sete reais e quarenta e cinco centavos) da(s) executada(s): J.HENRIQUE\n"
    "GALVANI COMERCIO DE ROUPAS - ME - CNPJ 10.547.889/0001-85, Riane Confecções de\n"
    "Roupas Ltda - ME - CNPJ: 00.219.460/0001-05, Jose Henrique Galvani - CPF: 234.846.406-34\n"
    "e Heliane Leonel Raymundo Galvani - CPF: 813.460.347-53, porventura\n"
    "existentes junto a instituições financeiras, incluindo cartões de crédito, agenciadores\n"
    "de pagamento, administradores de consórcio.")

for item in regex.finditer(pattern, text):
    print(item.groupdict())

Output

{'cnpj': '10.547.889/0001-85'}
{'cnpj': '00.219.460/0001-05'}

CodePudding user response:

Check if this works for you:

text = """
Solicite-se ao BANCO CENTRAL, via protocolo digital - SISBACEN ,
o BLOQUEIO de créditos existentes até o limite de R$ 30.257,45 (trinta mil, duzentos e
cinquenta e sete reais e quarenta e cinco centavos) da(s) executada(s): J.HENRIQUE
GALVANI COMERCIO DE ROUPAS - ME - CNPJ 10.547.889/0001-85, Riane Confecções de
Roupas Ltda - ME - CNPJ: 00.219.460/0001-05, Jose Henrique Galvani - CPF: 234.846.406-34
e Heliane Leonel Raymundo Galvani - CPF: 813.460.347-53, porventura
existentes junto a instituições financeiras, incluindo cartões de crédito, agenciadores
de pagamento, administradores de consórcio."""

pattern = r'[0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2}'

# cut text to start right after executada(s)
text = text.split("executada(s)")[1]

cnpjs = [{"cnpj": cnpj} for cnpj in regex.findall(pattern, text)]

print(cnpjs)
  • Related