import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" r"(" str(i) r")" r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s*
or be at the beginning (?:(?<=\s)|^)
or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping"
or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping"
.
For this reason it is important that after [.,;]
the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
CodePudding user response:
You can use lookarounds to exclude already encapsulated names and those followed by '
, an alphanumeric character or -
:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.