How do I split a string in a dataframe based on the number of spaces-CodePudding

I have a field in a dataframe with names as follows

Joseph Sam Smith
Angela Savage
James Taylor
William Smith Jr

I want to split it into four columns, first_name, middle_name, last_name, suffix. For this dataset it's probably ok (though not ideal) to assume the only possible suffix is Jr.

I've got the split assuming only first and last, but then I realized I need more than that.

df[['first_name','last_name']] = df['name'].str.split(" ", 1, expand=True)

Thanks in advance!

CodePudding user response：

Not a vectorised approach, but it gets the job done. There is an assumption that each person has a minimum of first and last name, i.e. no "Cher" or "Prince Jr".

setup

data = pd.Series([
    "Joseph Sam Smith",
    "Angela Savage",
    "James Taylor",
    "William Smith Jr",
])

suffixes = ["Jr", "III"]

solution

def decipher(name):
    l = [None]*4  # placeholder list
    tokens = name.split()
    l[0] = tokens.pop(0)  # first name
    if tokens[-1] in suffixes:  
        l[-1] = tokens.pop()  # add suffix to end of list
    l[2] = tokens.pop()  # last element of tokens must be last name
    if len(tokens) > 0:  # if there any elements left they are a middle name
        l[1] = tokens.pop()
    return pd.Series(l)

result = data.apply(decipher)

result is

         0     1       2     3
0   Joseph   Sam   Smith  None
1   Angela  None  Savage  None
2    James  None  Taylor  None
3  William  None   Smith    Jr

CodePudding user response：

name = "Joseph Sam Smith"
df =[["first_name","middle_name","last_name","suffix"]]
nameLis = name.split(" ")
if(len(nameLis)==3):
    nameLis.append("")
elif(len(nameLis)==2):
    nameLis.insert(1,"")
    nameLis.insert(3,"")

df.append(nameLis)

CodePudding user response：

>>> import pandas as pd
>>> x = pd.Series(["Joseph Sam Smith","Angela Savage", "James Taylor", "William Smith Jr"])
>>> x

0    Joseph Sam Smith
1       Angela Savage
2        James Taylor
3    William Smith Jr
dtype: object

>>> d = x.str.split(expand=True)
>>> d['suffix'] = None
>>> d.columns = ['FirstName', 'MiddleName', 'LastName', 'suffix']
>>> matched = d.loc[d.LastName.eq("Jr")]
>>> d.iloc[matched.index, 3] = d.iloc[matched.index, 2].to_list()
>>> d.iloc[matched.index, 2] = None
>>> d


    FirstName   MiddleName  LastName    suffix
0   Joseph      Sam         Smith       None
1   Angela      Savage      None        None
2   James       Taylor      None        None
3   William     Smith       None        Jr