how can i split a full name to first name, Middle name and last name in python jupyter notebook?-CodePudding

I want to split Full name in three columns such as first name, Middle name and last name for eg :

1   Moni Avhad
2   Kumar Krishnan Menon
3   Divya Sharma
4   Mimi Atul Sharma
5   Murad Oberoi

I used the below code to split the names:

import pandas as pd
df = pd.read_excel("Sample.xlsx")
df[['first Name','middle Name','Last Name']]=df['Full Name'].str.split(' ',n=2,expand=True)
df

but many a time my full name is not constant for example(Kumar Krishnan Menon , Mini Avinash Sharma) have three seperate names First,middle and last name.But in other cases middle name is not present.

Expected output:

    First Name   Middle Name   Last Name
1   Moni            N/A         Avhad
2   Kumar         Krishnan      Menon
3   Divya           N/A         Sharma
4   Mimi           Atul         Sharma
5   Murad           N/A         Oberoi

Also if there are multiple Surname such as Rani Amol Sharma-Warma How do we split that and one more scenario where no middle name is present but two surnames are present eg Disha Sharma Warma

CodePudding user response：

Use your solution without n=2 and space and then replace missing values in Last column by middle values and set Nones:

df[['first Name','middle Name','Last Name']] = df['Full Name'].str.split(expand=True)

m = df['Last Name'].isna()
df['Last Name'] = df['Last Name'].fillna(df['middle Name'])
df.loc[m, 'middle Name'] = None
print (df)
                  Full Name first Name middle Name Last Name
0                Moni Avhad       Moni        None     Avhad
1      Kumar Krishnan Menon      Kumar    Krishnan     Menon
2              Divya Sharma      Divya        None    Sharma
3          Mimi Atul Sharma       Mimi        Atul    Sharma
4              Murad Oberoi      Murad        None    Oberoi

Another idea with splitting to Series called s and set Nones by Series.mask:

s = df['Full Name'].str.split()
df['First_Name'] = s.str[0]
df['Middle_Name'] = s.str[1].mask(s.str.len().ne(3), None)
df['Last_Name'] = s.str[-1]
print (df)
                  Full Name First_Name Middle_Name Last_Name
0                Moni Avhad       Moni        None     Avhad
1      Kumar Krishnan Menon      Kumar    Krishnan     Menon
2              Divya Sharma      Divya        None    Sharma
3          Mimi Atul Sharma       Mimi        Atul    Sharma
4              Murad Oberoi      Murad        None    Oberoi

CodePudding user response：

Not the most elegant way but you could use apply:

df["FirstName"] = df["FullName"].apply(lambda x : x.split()[0])
df["MiddleName"] = df["FullName"].apply(lambda x : " ".join(x.split()[1:-1]) if len(x.split())>2 else None)
df["LastName"] = df["FullName"].apply(lambda x : x.split()[-1])

The benefit of this being that, if a person has more than one middle name, the code would not fail. Also, it is bad habit to leave spaces and/or special characters in the column names of the data frames. It's best to take care of that.

CodePudding user response：

You can use Series.str.extract() method here:

In [93]: df
Out[93]:
              Full Name
0            Moni Avhad
1  Kumar Krishnan Menon
2          Divya Sharma
3      Mimi Atul Sharma
4          Murad Oberoi

In [94]: df[["FirstName", "MiddleName", "LastName"]] = \
    df["Full Name"].str.extract(r"(?P<First_Name>\w )\s*(?P<Middle_Name>\w )?\s (?P<Last_Name>\w )")

In [95]: df
Out[95]:
              Full Name FirstName MiddleName LastName
0            Moni Avhad      Moni        NaN    Avhad
1  Kumar Krishnan Menon     Kumar   Krishnan    Menon
2          Divya Sharma     Divya        NaN   Sharma
3      Mimi Atul Sharma      Mimi       Atul   Sharma
4          Murad Oberoi     Murad        NaN   Oberoi

CodePudding user response：

You can use pd.str.split and fill middle name column only when length of string is 3.

splitted = df['Full Name'].str.split()
df['First_Name'] = splitted.str[0]
df['Last_Name'] = splitted.str[-1]
df['Middle_Name'] = df['Full Name'].loc[splitted.str.len() == 3].str.split(expand=True)[1]

which gives us the expected output :

  First_Name Middle_Name Last_Name
0       Moni         NaN     Avhad
1      Kumar    Krishnan     Menon
2      Divya         NaN    Sharma
3       Mimi        Atul    Sharma
4      Murad         NaN    Oberoi

CodePudding user response：

# Creating a dataframe
df = pd.DataFrame({'FullName':['Moni Avhad',
'Kumar Krishnan Menon',
'Divya Sharma',
'Mimi Atul Sharma',
'Murad Oberoi']})

# Using List Comprehension and checking the len and creating First Name, Mid Name, Last Name
df['FirstName'] = [x.split(' ')[0] for x in df['FullName']]
df['MidName'] = [x.split(' ')[1] if len(x.split(' '))==3 else np.NaN for x in df['FullName']]
df['LastName'] = [x.split(' ')[1] if len(x.split(' '))==2 else x.split(' ')[2] for x in df['FullName']]