import re
def register_new_persons_names_to_identify_in_inputs(input_text):
#Cases of compound human names:
name_capture_pattern = r"(^[A-Z](?:\w )\s*(?:del|de\s*el|de)\s*^[A-Z](?:\w ))?"
regex_pattern = name_capture_pattern r"\s*(?i:se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
n0 = re.search(regex_pattern, input_text) #distingue entre mayusculas y minusculas
if n0:
word, = n0.groups()
if(word == None or word == "" or word == " "): print("I think there was a problem, and although I thought you were giving me a name, I couldn't interpret it!")
else: print(repr(word))
input_text = "Creo que María del Pilar se trata de un nombre" #example 1
input_text = "Estoy segura que María dEl Pilar se tRatA De uN nOmbre" #example 2
input_text = "María del Carmen es un nombre viejo" #example 2
register_new_persons_names_to_identify_in_inputs(input_text)
In the Spanish language there are some names that are compounds, but in the middle they have a connector "del"
placed, which is sometimes written in upper case, and many other times it is usually left in lower case (even if it is a name).
Because when defining the regex indicating that each part of the name must start with a capital letter, it fails and does not correctly capture the name of the person. I think the error in my capture regex is in the captures for each of the names ^[A-Z](?:\w ))
I would also like to know if there is any way so that it does not matter if any of these connectors options (?:del|de\s*el|de)
are written in uppercase or lowercase, however it does with the rest of the sentence. Something like (?i:del|de\s*el|de)?-i
, but always without affecting the capture group (which is the name of the person)
This is the correct output that I need:
'María del Pilar' #for example 1
'María del Pilar' #for example 2
'María del Carmen' #for example 3
CodePudding user response:
A few things:
- remove 2
^
- add
í
to\w
([\wí]
, only added to first but maybe needs to be added to second too?) - add
E
todel
(d[eE]l
, or make case insensitive)
([A-Z](?:[\wí] )\s*(?:d[Ee]l|de\s*el|de)\s*[A-Z](?:\w ))?
which I think can be further reduced to (remove ()
):
([A-Z][\wí] \s*(d[eE]l|de\s*el|de)\s*[A-Z]\w )