I have a column of strings containing full names. Lastnames are distinguished as groups of all-uppercase letters while Firstnames are given in propercase. The majority of names are ordered as (Firstname, LASTNAME) but many contain LASTNAME information in the middle or at the beginning of the string, as in the last entries here.
0 Manuel JOSE
1 Vincent MUANDUMBA
2 Alejandro DE LORRES
3 Luis FILIPE da Rivera
4 LIM Jock Hoi
I would like to split this column into separate Firstname and Lastname columns according to whether the text in the string is in the propercase (Firstname) or in all-caps (Lastname).
new = df["FullName"].str.split(pat=r'(?=[A-Z][a-z])', n=1, expand = True)
df['FirstName'] = new[0]
df['LastName'] = new[1]
All strings in proper or lowercase should be grouped in new[0]
while all strings in uppercase should be grouped in new[1]
However, I can't seem to achieve this desired output since my regex isn't right. I've also tried pat=r'[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)'
CodePudding user response:
You can use regex:
df['LastName'] = df['FullName'].str.findall(r'\b[A-Z] (?:\s [A-Z] )*\b').str.join(' ')
df['FirstName'] = df['FullName'].str.findall(r"[A-Z]{0,1}[a-z] ").str.join(' ')
Output:
names last_names first_names
0 Manuel JOSE JOSE Manuel
1 Vincent MUANDUMBA MUANDUMBA Vincent
2 Alejandro DE LORRES DE LORRES Alejandro
3 Luis FILIPE da Rivera FILIPE Luis da Rivera
4 LIM Jock Hoi LIM Jock Hoi
CodePudding user response:
This code is a bit longer than using a str pattern, but you can be sure it sends every part of the name string to firstname or lastname as you want. Trick is using .istitle() function.
# Split every string in FullName column by returning a list of words
new = df["FullName"].str.split(' ')
# Create empty lists to keep new columns for df
FirstName = []
LastName = []
# Iterate over every splitted string (sample)
for name in new:
Proppercase =[] #This keeps values for FirstName condition
Allcaps = [] # This keeps values for LastName (all-caps)
# Iterate over every word in the sample
for n in name:
# Check if it is proppercase or lower ('da')
if n.istitle() or n.islower():
Proppercase.append(n)
# If not, it is all-caps
else:
Allcaps.append(n)
# Add proppercase words to FirstName list
FirstName.append(' '.join(Proppercase))
# All-caps words to LastName list
LastName.append(' '.join(Allcaps))
# Create columns
df['FirstName'] = FirstName
df['LastName'] = LastName
Output:
FullName FirstName LastName
0 Manuel JOSE Manuel JOSE
1 Vincent MUANDUMBA Vincent MUANDUMBA
2 Alejandro DE LORRES Alejandro DE LORRES
3 Luis FILIPE da Rivera Luis da Rivera FILIPE
4 LIM Jock Hoi Jock Hoi LIM
This can be faster if you are sure first word in the name is either complete Firstname or Lastname (most of cultures but less generalizable):
new = df["FullName"].str.split(' ',1)
FirstName = []
LastName = []
for name in new:
if name[0].istitle():
FirstName.append(name[0])
LastName.append(name[1])
else:
FirstName.append(name[1])
LastName.append(name[0])
df['FirstName'] = FirstName
df['LastName'] = LastName