Hey I want to split names into columns on uppercase letters. In most cases two names are combined, in some cases three. So I need to split several times. Names need to be added to extra columns. First part of string is always missing/ one new col is empty. Tried several ways/ str. split without success.
data = [['TomPeter', 10], ['NickFrank', 15], ['JuliLizaMary', 18]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
Name Age
0 TomPeter 10
1 NickFrank 15
2 JuliLizaMary 14
df[['5', '6']] = df['Name'].str.split('[A-Z][a-z]*', n=1, expand=True)
Name Age 5 6
0 TomPeter 10 Peter
1 NickFrank 15 Frank
2 JuliLizaMary 14 LizaMary
CodePudding user response:
You can use RegEx capturing groups and extract method:
f['Name'].str.extract('([A-Z][a-z]*)([A-Z][a-z]*)', expand=True)
CodePudding user response:
Add a space before the capital letters and split into first/middle/last columns:
df[['FN', 'MN', 'LN']] = df['Name'].replace(r'([A-Z])', r' \1', regex=True).str.split(expand=True)
# output
Name Age FN MN LN
0 TomPeter 10 Tom Peter None
1 NickFrank 15 Nick Frank None
2 JuliLizaMary 18 Juli Liza Mary
Then swap the middle/last columns if there are only 2 names (i.e., middle should be empty):
df.loc[df['LN'].isna(), ['MN', 'LN']] = df.loc[df['LN'].isna(), ['LN', 'MN']].values
# output
Name Age FN MN LN
0 TomPeter 10 Tom None Peter
1 NickFrank 15 Nick None Frank
2 JuliLizaMary 18 Juli Liza Mary
CodePudding user response:
The problem is that you use the wrong regex for split, your regex is suitable to find all word start with uppercase, but you shouldn't used it on split, it will split on each matched word so give you none returned:
df['Name'].str.split(r'[A-Z][a-z]*')
0 [, , ]
1 [, , ]
2 [, , , ]
You can try pandas apply on rows:
def splitUppercase(row):
import re
names = re.findall('[A-Z][a-z]*', row['Name'])
return names[0], ''.join(names[1:])
df[['5', '6']] = df.apply(splitUppercase, result_type='expand', axis=1)
df
Name Age 5 6
0 TomPeter 10 Tom Peter
1 NickFrank 15 Nick Frank
2 JuliLizaMary 18 Juli LizaMary
To avoid apply, you can design a pattern that can extract first word with uppercase and the other words start with uppercase like following:
df[['5', '6']] = df['Name'].str.extract('(^[A-Z][a-z]*)([A-Z].*$)', expand=True)