Home > Software design >  Set alphanumeric regex pattern not accepting certain specific symbols
Set alphanumeric regex pattern not accepting certain specific symbols

Time:12-15

import re

#Examples:
input_text = "Recien el 2021-10-12 despues de 3 dias 2021-10-12" #NOT PASS
input_text = "Recien el 2021-10-12 hsah555sahsdhj. Ya despues de 3 dias hjsdfhjdsfhjdsf 2021-10-12" #NOT PASS
input_text = "Recien el 2021-10-12 hsah555sahsdhj; despues de 3 dias hjsdfhjdsfhjdsf 2021-10-12" #NOT PASS
input_text = "Recien el 2021-10-12 hsah555sahsdhj despues de 3 dias hjsdfhjdsfhjdsf.\n 2021-10-12" #NOT PASS
input_text = "Recien el 2021-10-12 hsah555sahsdhj; mmm... creo que ya despues de 3 dias hjsdfhjdsfhjdsf.\n 2021-10-12" #PASS
input_text = "Recien el 2021-10-12 hsah555sahsdhj.    \n\n\n mmm... creo que ya despues de 3 dias hjsdfhjdsfhjdsf.\n 2021-10-12" #PASS


some_text = r"[\s|]*"  # <--- I NEED MODIFY THIS PATTERN
date_format = r"\d*-\d{2}-\d{2}"

check_00 = re.search(date_format   some_text   r"(?:(?:pasados|pasado|despues del|despues de el|despues de|despues|tras) (\d ) (?:días|día|dias|dia)|(\d ) (?:días|día|dias|dia) (?:pasados|pasado|despues del|despues de el|despues de|despues|tras))", input_text, re.IGNORECASE)
check_01 = re.search(r"(?:(?:pasados|pasado|despues del|despues de el|despues de|despues|tras) (\d ) (?:días|día|dias|dia)|(\d ) (?:días|día|dias|dia) (?:pasados|pasado|despues del|despues de el|despues de|despues|tras))"   some_text   date_format, input_text, re.IGNORECASE)

if not check_00 and not check_01: print("1")
else: print("0")

I need to set in the variable some_text a pattern that identify any alphanumeric substrings (that could possibly contain symbols included, such as : , $, #, &, ?, ¿, !, ¡, |, °, , , ., (, ), ], [, }, { ), and with the possibility of containing uppercase and lowercase characters, but the only symbols that should not to be present, not even once, are ; and .\n or .[\s|]*\n*

In this case I need to determine which cases does NOT meet, therefore, the if not conditionals in the code.

The output you should get if everything in the algorithm works fine would be this:

0  #for example 1
0  #for example 2
0  #for example 3
0  #for example 4
1  #for example 5
1  #for example 6

Is it possible, within the same pattern that I want to place in the some_text variable, to indicate a list with the symbols that I do NOT want to appear in that identification area of the pattern (in this case ; and .[\s|]*\n* )?

CodePudding user response:

but the only symbols that should not to be present, not even once, are ; and .\n or .[\s|]\n

For not allowing ; you can simply use [^;].

Regarding the other two "patterns": the [\s|] pattern makes a wrong assumption: a pipe symbol inside a character class will be interpreted literally. It seems you want to indicate with it that the \s is optional, but the asterisk already ensures this. The point must be escaped. So \.\s*?\n. But to disallow it, you can put it in a negative look-ahead: (?!\.\s*?\n).

This leads to:

some_text = r"(?:(?!\.\s*?\n)[^;])*"
  • Related