I have a field in a dataframe with names as follows
Joseph Sam Smith
Angela Savage
James Taylor
William Smith Jr
I want to split it into four columns, first_name, middle_name, last_name, suffix. For this dataset it's probably ok (though not ideal) to assume the only possible suffix is Jr.
I've got the split assuming only first and last, but then I realized I need more than that.
df[['first_name','last_name']] = df['name'].str.split(" ", 1, expand=True)
Thanks in advance!
CodePudding user response:
Not a vectorised approach, but it gets the job done. There is an assumption that each person has a minimum of first and last name, i.e. no "Cher" or "Prince Jr".
setup
data = pd.Series([
"Joseph Sam Smith",
"Angela Savage",
"James Taylor",
"William Smith Jr",
])
suffixes = ["Jr", "III"]
solution
def decipher(name):
l = [None]*4 # placeholder list
tokens = name.split()
l[0] = tokens.pop(0) # first name
if tokens[-1] in suffixes:
l[-1] = tokens.pop() # add suffix to end of list
l[2] = tokens.pop() # last element of tokens must be last name
if len(tokens) > 0: # if there any elements left they are a middle name
l[1] = tokens.pop()
return pd.Series(l)
result = data.apply(decipher)
result
is
0 1 2 3
0 Joseph Sam Smith None
1 Angela None Savage None
2 James None Taylor None
3 William None Smith Jr
CodePudding user response:
name = "Joseph Sam Smith"
df =[["first_name","middle_name","last_name","suffix"]]
nameLis = name.split(" ")
if(len(nameLis)==3):
nameLis.append("")
elif(len(nameLis)==2):
nameLis.insert(1,"")
nameLis.insert(3,"")
df.append(nameLis)
CodePudding user response:
>>> import pandas as pd
>>> x = pd.Series(["Joseph Sam Smith","Angela Savage", "James Taylor", "William Smith Jr"])
>>> x
0 Joseph Sam Smith
1 Angela Savage
2 James Taylor
3 William Smith Jr
dtype: object
>>> d = x.str.split(expand=True)
>>> d['suffix'] = None
>>> d.columns = ['FirstName', 'MiddleName', 'LastName', 'suffix']
>>> matched = d.loc[d.LastName.eq("Jr")]
>>> d.iloc[matched.index, 3] = d.iloc[matched.index, 2].to_list()
>>> d.iloc[matched.index, 2] = None
>>> d
FirstName MiddleName LastName suffix
0 Joseph Sam Smith None
1 Angela Savage None None
2 James Taylor None None
3 William Smith None Jr