I am still at the beginning of my Python journey and need your help with the following task:
After webscraping for contact details, I get a string for the company's CEO. This string contains salutation, title, last name and first name of the CEO. I would like to split this string to the corresponding elements (salutation, title, last name and first name). My problem is that the elements vary greatly, so for example:
- with or without salutation
- different titles
- one or more first names
- one or more surnames
The order of the elements is always the same. There is also always only one last name.
#some examples for the string:
ceo1 = "Herr Dr. Mustermann Max" #salutation, titel, last name and first name
ceo2 = "Müller Monika" #just firstname and lastname
ceo3 = "Frau Mustermann Iris Petra" #salutation, last name and 2x first name
ceo4 = "Herr Mag. Dr. Schubert Franz Peter" #salutation, 2x titel, last name and 2x first name
ceo5 = "Herr Dipl.-Ing. BA Mozart Wolfgang Amadeus" #salutation, 2x titel (one without a dot at the end), last name and 2x first name
ceo = ceo2
#get salutation:
salutation_list = ["Herr", "Frau"]
salutation_test = bool(sum(map(lambda x: x in ceo, salutation_list)))
if salutation_test is True:
salutation = ceo[0:4]
ceo_without_salutation = ceo[5:]
else:
salutation = "N/A"
ceo_without_salutation = ceo
#get title:
title_list = ["Dr.", "Mag. Dr.", "BA", "Dipl.-Ing."]
title_test = bool(sum(map(lambda x: x in ceo_without_salutation, title_list)))
if title_test is True:
title = "titel" #How can I extract the corresponding element from the list and eliminate it from string 'ceo_without_salutation'?
ceo_without_title = "ceo_without_salutation - titel"
else:
title = "N/A"
ceo_without_title = ceo_without_salutation
name_list = ceo_without_title.split(" ")
#get lastname
lastname = name_list[0]
#get firstname
del name_list[0]
firstname = "".join(name_list)
Most important question: how can I extract the title? And beyond that, is there a better way than mine to solve the issue? Thanks a lot for your help!
CodePudding user response:
If title is always followed by a 'dot' "." you can use regex positive lookahead (regular expressions). here is a sample function that gets text and extract the title :
import re
def get_title(text:str):
result = re.search(r'\w (?=\.)',text)
if result :
return result.group()
CodePudding user response:
Ok, I now have at least a solution that works. If someone knows a more python-ish way, I would be very happy to learn from it. Thanks a lot!
#some ceo-string examples:
ceo1 = "Herr Dr. Mustermann Max" #salutation, titel, last name and first name
ceo2 = "Müller Monika" #just firstname and lastname
ceo3 = "Frau Mustermann Iris Petra" #salutation, last name and 2x first name
ceo4 = "Herr Mag. Dr. Schubert Franz Peter" #salutation, 2x titel, last name and 2x first name
ceo = ceo4
#get salutation:
salutation_list = ["Herr", "Frau"]
salutation_test = bool(sum(map(lambda x: x in ceo, salutation_list)))
if salutation_test is True:
salutation = ceo[0:4]
ceo_without_salutation = ceo[5:]
else:
salutation = "N/A"
ceo_without_salutation = ceo
#get title:
title_list = ["Dr.", "Mag. Dr.", "Dipl.-Ing.", "BA", "MAS"]
title_test = bool(sum(map(lambda x: x in ceo_without_salutation, title_list)))
titles_ceo = []
index_max = []
if title_test is True:
for i in title_list:
x = ceo_without_salutation.find(i)
if x >= 0:
y = ceo_without_salutation.find(" ", x)
z = ceo_without_salutation[x:y]
index_max.append(y)
titles_ceo.append(z)
else:
continue
title = " ".join(titles_ceo)
j = max(index_max)
k = len(ceo_without_salutation)
ceo_without_title = ceo_without_salutation[j 1:k]
else:
title = "N/A"
ceo_without_title = ceo_without_salutation
#get firstname and lastname:
m = ceo_without_title.find(" ")
lastname = ceo_without_title[0:m]
firstname = ceo_without_title[m 1:]
print(salutation)
print(title)
print(firstname)
print(lastname)