Home > Mobile >  Split column at uppercase letters into separate columns
Split column at uppercase letters into separate columns

Time:03-19

Hey I want to split names into columns on uppercase letters. In most cases two names are combined, in some cases three. So I need to split several times. Names need to be added to extra columns. First part of string is always missing/ one new col is empty. Tried several ways/ str. split without success.

data = [['TomPeter', 10], ['NickFrank', 15], ['JuliLizaMary', 18]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
         Name    Age

0      TomPeter   10
1     NickFrank   15
2  JuliLizaMary   14

df[['5', '6']] = df['Name'].str.split('[A-Z][a-z]*', n=1, expand=True)

           Name  Age   5     6
0      TomPeter   10       Peter
1     NickFrank   15       Frank
2  JuliLizaMary   14       LizaMary

CodePudding user response:

You can use RegEx capturing groups and extract method:

f['Name'].str.extract('([A-Z][a-z]*)([A-Z][a-z]*)', expand=True)

CodePudding user response:

Add a space before the capital letters and split into first/middle/last columns:

df[['FN', 'MN', 'LN']] = df['Name'].replace(r'([A-Z])', r' \1', regex=True).str.split(expand=True)

# output
           Name  Age    FN     MN    LN
0      TomPeter   10   Tom  Peter  None
1     NickFrank   15  Nick  Frank  None
2  JuliLizaMary   18  Juli   Liza  Mary

Then swap the middle/last columns if there are only 2 names (i.e., middle should be empty):

df.loc[df['LN'].isna(), ['MN', 'LN']] = df.loc[df['LN'].isna(), ['LN', 'MN']].values

# output
           Name  Age    FN    MN     LN
0      TomPeter   10   Tom  None  Peter
1     NickFrank   15  Nick  None  Frank
2  JuliLizaMary   18  Juli  Liza   Mary

CodePudding user response:

The problem is that you use the wrong regex for split, your regex is suitable to find all word start with uppercase, but you shouldn't used it on split, it will split on each matched word so give you none returned:

df['Name'].str.split(r'[A-Z][a-z]*')

0      [, , ]
1      [, , ]
2    [, , , ]

You can try pandas apply on rows:

def splitUppercase(row):
    import re
    names = re.findall('[A-Z][a-z]*', row['Name'])
    return names[0], ''.join(names[1:])

df[['5', '6']] = df.apply(splitUppercase, result_type='expand', axis=1)
df

           Name  Age     5         6
0      TomPeter   10   Tom     Peter
1     NickFrank   15  Nick     Frank
2  JuliLizaMary   18  Juli  LizaMary

To avoid apply, you can design a pattern that can extract first word with uppercase and the other words start with uppercase like following:

df[['5', '6']] = df['Name'].str.extract('(^[A-Z][a-z]*)([A-Z].*$)', expand=True)
  • Related