Replace elements in list with substring contained in elements-CodePudding

Given this list of names

names=['Anna, Mrs. James (Nadia Elf)', 'Martin, Mr. Michael ', 'Barese, Mr. Alfred', 'Amalfi, Mrs. Abby (J E V)'
      ,'Manuel, Dr. Louis']

I am trying to replace each name with just its title.

For instance, the first name is 'Anna, Mrs. James (Nadia Elf)' and I want to replace it with just 'Mrs'.

In order to make sure that the following code works the way I want to and each name is replaced by its title, I have to repeat these lines

titles = []
for i in names:
    new_string = i.replace("Anna, Mrs. James (Nadia Elf)", "Mrs")
    titles.append(new_string)
print(titles)

for every single item in the list. How should I write a code that takes care of this issue with no more than a few lines?

CodePudding user response：

Assuming that a person's title is always located as the first element after the comma, you could apply a series of .split() as shown in my first approach. However, as this might not be the case, you can use a regular expression to look for any multi-char string ended by a dot and surrounded by whitespaces in the part after the comma as shown in my second approach:

import re

names = [
    'Anna, Mrs. James (Nadia Elf)',
    'Martin, Mr. Michael',
    'Barese, Mr. Alfred',
    'Amalfi, Mrs. Abby (J E V)',
    'Manuel, Dr. Louis'
]

regex = r'. \,\s(. \.)\s'
preg = re.compile(regex)


def split_name(s):
    # assuming that the title is the first element after the comma
    return s.split(',')[-1].strip().split(' ', 1)[0]


def regex_name(s):
    # a more flexible approach, assuming the title is always ending with a dot and there are not other dots in the name
    m = preg.match(s)
    if m:
        ret, = m.groups(0)
        return ret


# use first approach based on split
titles = [split_name(name) for name in names]
print(titles)


# use second approach based on regex
titles = [regex_name(name) for name in names]
print(titles)

Printed output:

['Mrs.', 'Mr.', 'Mr.', 'Mrs.', 'Dr.']
['Mrs.', 'Mr.', 'Mr.', 'Mrs.', 'Dr.']

CodePudding user response：

Here's an approach with regex that should handle any titles. The only thing you have to be careful of is that there aren't other words with periods immediately after them; for example, like in a sentence.

To get around that, I guess you could restrict the regex to only match words with 2-3 letters before a period.

import re

names = ['Anna, Mrs. James (Nadia Elf)',
         'Martin, Mr. Michael ',
         'Barese, Mr. Alfred',
         'Amalfi, Mrs. Abby (J E V)',
         'Manuel, Dr. Louis']

titles = [re.sub(r'.*\b(\w \.)(?:$|\s).*', r'\1', name) for name in names]

print(titles)
# ['Mrs.', 'Mr.', 'Mr.', 'Mrs.', 'Dr.']

Edit: this is a more stricter regex that you could also use .*\b([A-Z]\w{1,2}\.)(?:$|\s).*

This matches an uppercase character followed by 1-2 characters and then a period. For example, it matches titles like Mrs. but not mrs.

Here's a link for testing it out: Link

As mentioned in the comments, this can behave somewhat unexpectedly for some inputs. For example, given a string like "Joe Smith, Jr." it will determine the title as Jr., which likely may not be what we want.

There are also some other edge cases in the inputs, that might be worth considering (that currently the regex above doesn't account for):

multiple titles
middle initials (I suppose that's covered by the "stricter" variation of the regex above, so an initial like L. shouldn't be matched, provided that it's only one letter long)
variations like Miss, Lord, Sir, Madam etc.

CodePudding user response：

If you only have 2 options of 'Mr' and 'Mrs' just check if theyre in the string.

['Mr' if 'Mr' in name else 'Mrs' for name in name]

Edit: I noticed you also have other titles such as 'Dr', you should first create a dict or maybe even an ENUM of all the titles you want to address and then follow the above example

CodePudding user response：

Besides using regular expression. You can use list comprehension

names=['Anna, Mrs. James (Nadia Elf)', 'Martin, Mr. Michael ', 'Barese, Mr. Alfred', 'Amalfi, Mrs. Abby (J E V)', 'Manuel, Dr. Louis']
titles = [(name.split(',')[1].split('.')[0]) for name in names]
print(titles)

Print output

[' Mrs', ' Mr', ' Mr', ' Mrs', ' Dr']