Define capture regex for name recognition of composite people connected by a connector-CodePudding

import re

def register_new_persons_names_to_identify_in_inputs(input_text):

    #Cases of compound human names:
    name_capture_pattern = r"(^[A-Z](?:\w )\s*(?:del|de\s*el|de)\s*^[A-Z](?:\w ))?"
    regex_pattern = name_capture_pattern   r"\s*(?i:se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"

    n0 = re.search(regex_pattern, input_text) #distingue entre mayusculas y minusculas

    if n0:
        word, = n0.groups()
        if(word == None or word == "" or word == " "): print("I think there was a problem, and although I thought you were giving me a name, I couldn't interpret it!")
        else: print(repr(word))


input_text = "Creo que María del Pilar se trata de un nombre"   #example 1
input_text = "Estoy segura que María dEl Pilar se tRatA De uN nOmbre"   #example 2
input_text = "María del Carmen es un nombre viejo"    #example 2

register_new_persons_names_to_identify_in_inputs(input_text)

In the Spanish language there are some names that are compounds, but in the middle they have a connector "del" placed, which is sometimes written in upper case, and many other times it is usually left in lower case (even if it is a name).

Because when defining the regex indicating that each part of the name must start with a capital letter, it fails and does not correctly capture the name of the person. I think the error in my capture regex is in the captures for each of the names ^[A-Z](?:\w ))

I would also like to know if there is any way so that it does not matter if any of these connectors options (?:del|de\s*el|de) are written in uppercase or lowercase, however it does with the rest of the sentence. Something like (?i:del|de\s*el|de)?-i, but always without affecting the capture group (which is the name of the person)

This is the correct output that I need:

'María del Pilar'    #for example 1
'María del Pilar'    #for example 2
'María del Carmen'   #for example 3

CodePudding user response：

A few things:

remove 2 ^
add í to \w ([\wí], only added to first but maybe needs to be added to second too?)
add E to del (d[eE]l, or make case insensitive)

([A-Z](?:[\wí] )\s*(?:d[Ee]l|de\s*el|de)\s*[A-Z](?:\w ))?

which I think can be further reduced to (remove ()):

([A-Z][\wí] \s*(d[eE]l|de\s*el|de)\s*[A-Z]\w )

https://regex101.com/r/bpKa12/1