Regular expression to match honorific title followed by full name-CodePudding

Given a file 'funeral.txt' write a regular expression for people names that is triggered by honorifics (Mr, Ms, Senator, etc).

I have so far written the code

import re
file = open("funeral.txt", "r")
text = file.read()
file.close()
honorific_names = "([M][a-z\.]{2,4})\s ([A-Z][a-z]*[\.\-\']*)*"
re.findall(honorific_names, text)

This gives me the output as

[('Mrs.', 'Meghan'), ('Mrs.', 'Oprah'), ('Mr.', 'Boris'), ('Ms.', 'Megan')]

However what I am looking for is:

['Mrs. Meghan', 'Mrs. Oprah Winfrey', 'Mr. Boris Johnson', 'Ms. Megan Specia']

How exactly should I modify the pattern ?

Some text from funeral.txt

The service also came just weeks after he and his wife, Mrs. Meghan, the Duchess of Sussex, gave a bombshell interview to Mrs. Oprah Winfrey in which they laid bare their problems with the royal family

According to Mr. Boris Johnson, the whole of United Kingdom are sadden by the passing of Prince Philip.  

— Ms. Megan Specia

CodePudding user response：

Apply " ".join on each pair to make one string

[" ".join(x) for x in [('Mrs.', 'Meghan'), ('Mrs.', 'Oprah'), ('Mr.', 'Boris'), ('Ms.', 'Megan')]]
['Mrs. Meghan', 'Mrs. Oprah', 'Mr. Boris', 'Ms. Megan']

import re

with open("funeral.txt", "r") as file:
    text = file.read()
result = [" ".join(x) for x in re.findall(r"([M][a-z.]{2,4})\s ([A-Z][a-z]*[.\-\']*)*", text)]

CodePudding user response：

honorific_names = r'((?:Mrs|Mr|Ms|Senator|Sr|Dr)\.?\s(?:[A-Z][a-z] )(?:\s[A-Z][a-z] )*)'

I think this would at least check out on your sample text, should be easy enough to expand on for your extended need. It's not the cleanest solution maybe, one could compact the Mr/Ms/Mrs but for readability it makes sense I'd say.