I need a regex that extracts all the names (we will consider that they are all the words that start with a capital letter and respect having certain conditions prior to their appearance within the sentence) that are in a sentence. This must be done respecting the pattern that I clarify below, also extracting the content before and after this name, so that it can be printed next to the name that was extracted within that sequence or pattern.
This is the pseudo-regex pattern that I need:
the beginning of the input sentence or (,|;|.|y)
associated_sense_1: "some character string (alphanumeric)" or "nothing"
(con |juntos a |junto a |en compania de )
identified_person: "some word that starts with a capital letter (the name that I must extract)" and it ends when the regex find one or more space
associated_sense_2: "some character string (alphanumeric)" or "nothing"
the end o the input sentence or (,|;|.|y |con |juntos a |junto a |en compania de )
the (,|;|.|y) are just person connectors that are used to build a regex pattern, but they do not provide information beyond indicating the sequence of belonging, then they can be eliminated with a .replace( , "")
And with this regex I need extract this 3 string groups
associated_sense_1
identified_person
associated_sense_2
associated_sense = associated_sense_1 " " associated_sense_2
This is the proto-code:
import re
#Example 1
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
#Example 2
#sense = "Adrian ya esta en la parada; y alli probablemente esten Lucy y May en la parada esperandonos"
person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[A-Z][^A-Z]*"
#person_identify_pattern = r"\s*(con |por |, y |, |,y |y )\s*[^A-Z]*"
for identified_person in re.split(person_identify_pattern, sense):
identified_person = identified_person.strip()
if identified_person:
try:
print(f"Write '{associated_sense}' to {identified_person}.txt")
except:
associated_sense = identified_person
The wrong output I get...
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to con.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Melisa.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to ,.txt
Write 'puede ser peligroso ir solas, quizas sea mejor ir' to Lucy en la parada.txt
Correct output for example 1:
Write 'quizas sea mejor ir con' to Adrian.txt
Write 'y seguro que luego podemos esperar por en la parada' to Melisa.txt
Write 'y seguro que luego podemos esperar por en la parada' to Marcos.txt
Write 'y seguro que luego podemos esperar por en la parada' to Lucy.txt
Correct output for example 2:
Write 'ya esta en la parada' to Adrian.txt
Write 'alli probablemente esten en la parada esperandonos' to Lucy.txt
Write 'alli probablemente esten en la parada esperandonos' to May.txt
I was trying with this other regex but I still have problems with this code:
import re
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
person_identify_pattern = r"\s*(?:,|;|.|y |con |juntos a |junto a |en compania de |)\s*((?:\w\s*) )\s*(?<=con|por|a, | y )\s*([A-Z].*?\b)\s*((?:\w\s*) )\s*(?:,|;|.|y |con |juntos a |junto a |en compania de )\s*"
for m in re.split(person_identify_pattern, sense):
m = m.strip()
if m:
try:
print(f"Write '{content}' to {m}.txt")
except:
content = m
But I keep getting this wrong output
Write 'puede ser peligroso ir solas' to quizas sea mejor ir con Adrian y seguro que luego podemos esperar por.txt
Write 'puede ser peligroso ir solas' to Melisa,.txt
Write 'puede ser peligroso ir solas' to Marcos y Lucy en la parad.txt
CodePudding user response:
import re
sense = "puede ser peligroso ir solas, quizas sea mejor ir con Adrian y seguro que luego podemos esperar por Melisa, Marcos y Lucy en la parada"
if match := re.findall(r"(?<=con|por|a, | y )\s*([A-Z].*?\b)", sense):
print(match)
it result = ['Adrian', 'Melisa', 'Marcos', 'Lucy']