Get elements (salutation, title, last name and first name) from string with varying number of elemen-CodePudding

I am still at the beginning of my Python journey and need your help with the following task:

After webscraping for contact details, I get a string for the company's CEO. This string contains salutation, title, last name and first name of the CEO. I would like to split this string to the corresponding elements (salutation, title, last name and first name). My problem is that the elements vary greatly, so for example:

with or without salutation
different titles
one or more first names
one or more surnames

The order of the elements is always the same. There is also always only one last name.

#some examples for the string:
ceo1 = "Herr Dr. Mustermann Max" #salutation, titel, last name and first name
ceo2 = "Müller Monika" #just firstname and lastname
ceo3 = "Frau Mustermann Iris Petra" #salutation, last name and 2x first name
ceo4 = "Herr Mag. Dr. Schubert Franz Peter" #salutation, 2x titel, last name and 2x first name
ceo5 = "Herr Dipl.-Ing. BA Mozart Wolfgang Amadeus" #salutation, 2x titel (one without a dot at the end), last name and 2x first name

ceo = ceo2

#get salutation:
salutation_list = ["Herr", "Frau"]
salutation_test = bool(sum(map(lambda x: x in ceo, salutation_list)))

if salutation_test is True:
    salutation = ceo[0:4]
    ceo_without_salutation = ceo[5:]
else:
    salutation = "N/A"
    ceo_without_salutation = ceo

#get title:
title_list = ["Dr.", "Mag. Dr.", "BA", "Dipl.-Ing."]
title_test = bool(sum(map(lambda x: x in ceo_without_salutation, title_list)))

if title_test is True:
    title = "titel" #How can I extract the corresponding element from the list and eliminate it from string 'ceo_without_salutation'?
    ceo_without_title = "ceo_without_salutation - titel"
else:
    title = "N/A"
    ceo_without_title = ceo_without_salutation

name_list = ceo_without_title.split(" ") 

#get lastname
lastname = name_list[0]

#get firstname
del name_list[0] 
firstname = "".join(name_list)

Most important question: how can I extract the title? And beyond that, is there a better way than mine to solve the issue? Thanks a lot for your help!

CodePudding user response：

If title is always followed by a 'dot' "." you can use regex positive lookahead (regular expressions). here is a sample function that gets text and extract the title :

import re
def get_title(text:str):
    result = re.search(r'\w (?=\.)',text)
    if result :
      return result.group()

CodePudding user response：

Ok, I now have at least a solution that works. If someone knows a more python-ish way, I would be very happy to learn from it. Thanks a lot!

#some ceo-string examples:
ceo1 = "Herr Dr. Mustermann Max" #salutation, titel, last name and first name
ceo2 = "Müller Monika" #just firstname and lastname
ceo3 = "Frau Mustermann Iris Petra" #salutation, last name and 2x first name
ceo4 = "Herr Mag. Dr. Schubert Franz Peter" #salutation, 2x titel, last name and 2x first name

ceo = ceo4

#get salutation:
salutation_list = ["Herr", "Frau"]
salutation_test = bool(sum(map(lambda x: x in ceo, salutation_list)))

if salutation_test is True:
    salutation = ceo[0:4]
    ceo_without_salutation = ceo[5:]
else:
    salutation = "N/A"
    ceo_without_salutation = ceo

#get title:
title_list = ["Dr.", "Mag. Dr.", "Dipl.-Ing.", "BA", "MAS"]
title_test = bool(sum(map(lambda x: x in ceo_without_salutation, title_list)))

titles_ceo = []
index_max = []

if title_test is True:
    for i in title_list:
        x = ceo_without_salutation.find(i)
        if x >= 0:
            y = ceo_without_salutation.find(" ", x)
            z = ceo_without_salutation[x:y]
            index_max.append(y)
            titles_ceo.append(z)
        else:
            continue
    
    title = " ".join(titles_ceo)
    
    j = max(index_max)
    k = len(ceo_without_salutation)
    
    ceo_without_title = ceo_without_salutation[j 1:k]

else:
    title = "N/A"
    ceo_without_title = ceo_without_salutation

#get firstname and lastname:
m = ceo_without_title.find(" ")
lastname = ceo_without_title[0:m]
firstname = ceo_without_title[m 1:]


print(salutation)
print(title)
print(firstname)
print(lastname)