import re
word = ""
input_text = "Creo que July no se trata de un nombre" #example 1, should match with the Case 00
#input_text = "Creo que July Moore no se trata de un nombre" #example 2, should not match any case
#input_text = "Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre" #example 3, should match with the Case 01
#input_text = "July Moore no se trata de un nombre" #example 4, should match with the Case 01
name_capture_pattern_00 = r"((?:\w ))?" # does not tolerate whitespace in middle
#name_capture_pattern_01 = r"((?:\w\s*) )"
name_capture_pattern_01 = r"(^[A-Z](?:\w\s*) )" # tolerates that there are spaces but forces it to be a word that begins with a capital letter
#Case 00
regex_pattern_00 = name_capture_pattern_00 r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]\s*)" name_capture_pattern_01 r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
#Taking the regex pattern(case 00 or case 01), it will search the string and then try to extract the substring of interest using capturing groups.
n0 = re.search(regex_pattern_00, input_text)
if n0 and word == "":
word, = n0.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word)) # --> print the substring that I captured with the capturing group
If in front of the pattern there is a .\s*
, a ,\s*
, a ;\s*
, or if it is simply the beginning of the input string, then use this capture pattern name_capture_pattern_01 = r"((?:\w\s*) )?"
, but if that is not the case, use this other capture pattern name_capture_pattern_00 = r"((?:\w ))?"
I think that in case 00 you should add something like this at the beginning of the pattern (?:(?<=\s)|^)
That way you would get these 2 possible resulting patterns after concatenate, where perhaps an or
condition |
can be set inside the search pattern:
In Case 00
...
(?:\.|\;|\,)
or the start of the string
((?:\w\s*) )?
r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
In other case (Case 01
)...
((?:\w ))??
r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
But in both cases (Case 00
or Case 01
, depending on what the program identifies) it should match the pattern and extract the capturing group to store it in the variable called as word
.
And the correct output for each of these cases would be the capture group that should be obtained and printed in each of these examples:
'July' #for the example 1
'' #for the example 2
'July Moore' #for the example 3
'July Moore' #for the example 4
EDIT CODE:
This code, although it appears that the regex patterns are well established, fails by returning as output only the last part of the name, in this case "Moore"
, and not the full name "July Moore"
import re
#Here are 2 examples where you can see this "capture error"
input_text = "HghD djkf ; July Moore no se trata de un nombre"
input_text = "July Moore no se trata de un nombre"
word = ""
#name_capture_pattern_01 = r"((?:\w\s*) )"
name_capture_pattern_01 = r"([A-Z][a-z] (?:\s*[A-Z][a-z] )*)"
#Case 01
regex_pattern_01 = r"(?:^|[.;,]\s*)" name_capture_pattern_01 r"\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de\s*un\s*nombre|se\s*trata\s*de\s*un\s*nombre|(?:ser[íi]a|es)\s*un\s*nombre)"
n1 = re.search(regex_pattern_01, input_text)
if n1 and word == "":
word, = n1.groups()
word = word.strip()
print(repr(word))
In both examples, since it complies with starting with (?:^|[.;,]\s*)
and starting with a capital letter like this pattern ([A-Z][a-z] (?:\s*[A-Z][a-z] )*)
, it should print the full name in the console July Moore
. It's quite curious but placing this pattern makes it impossible for me to capture a complete name under these conditions established by the search pattern.
CodePudding user response:
If I understood correctly, you want to exclude cases where both of the following are true:
- The name consists of more than one word; AND
- The name does not occur at the start of a sentence
You could use just one regex and then inspect the match to decide whether the above condition occurs.
Here is a script I tested with:
import re
texts = [
# Name is NOT at start of sentence, Name has SINGLE word:
"Creo que July no se trata de un nombre",
# Name is NOT at start of sentence, Name has MULTIPLE words:
"Creo que July Moore no se trata de un nombre",
# Name is at START of sentence, Name has MULTIPLE words:
"Efectivamente esa es una lista de nombres. July Moore no se trata de un nombre",
"July Moore Donald no se trata de un nombre",
# Name is at START of sentence, Name has SINGLE word:
"July no se trata de un nombre",
]
for input_text in texts:
regex = r"(^|[.;,]\s*)?([A-Z][a-z] (\s*[A-Z][a-z] )*)\s*(?i:no)\s*(?i:se\s*tratar[íi]a\s*de|se\s*trata\s*de|(?:ser[íi]a|es))\s*un\s*nombre"
print("input:", input_text)
for match in re.finditer(regex, input_text):
word = ""
# match[1] is not None => match is at start of a sentence.
# match[3] is not None => match has name with more than one word.
if match[1] is not None or not match[3]:
word = match[2]
print(" match:", repr(word) if word else "(no match)")
Notes:
- I used
finditer
as in theory there might be more than one match in an input string - The use of
\s*
instead of\s
is odd, but in comments you indicated that this is intended as you want to capture cases where some space separation is left out. - Names can look more complex than just
[A-Z][a-z]
. Some names include hyphens, apostrophes or other characters, not to mention letters from other alphabets. The letter following a hyphen might be upper or lower case... etc.