Home > Software engineering >  pandas with regex for a calculated column
pandas with regex for a calculated column

Time:10-13

I start with a regex to extract matches of strings without numbers with length of 2 or more and join them to make a new field

import re
# testing regex here
part = '24C-435-SA'
part1 = '24CE-435-SA'
group = "-".join(re.findall(r'[A-za-z]{2,}', part))
group1 = "-".join(re.findall(r'[A-za-z]{2,}', part1))
print("Group:", group, "Group1:", group1)

and result is Group: SA Group1: CE-SA

All good there, now I'm trying to apply the same regex on a pandas dataframe df and running into problems as below

        part    unit_price
0   24C-435-SA  10
1   30C-435-SB  20
2   36C-435-SC  30
3   42C-435-SD  40

Now I try the new column:

df['group'] = "-".join(re.findall(r'[A-za-z]{2,}', str(df['part'])))

or even different syntax to avoid some pandas warnings:

df.loc[:, 'group'] = "-".join(re.findall(r'[A-za-z]{2,}', str(df['part'])))

and I get this:

        part    unit_price                group
0   24C-435-SA  10          SA-SB-SC-SD-Name-part-dtype-object
1   30C-435-SB  20          SA-SB-SC-SD-Name-part-dtype-object
2   36C-435-SC  30          SA-SB-SC-SD-Name-part-dtype-object
3   42C-435-SD  40          SA-SB-SC-SD-Name-part-dtype-object

I see it keeps joining all rows matches together instead of expected values:

    group
    SA
    SB
    SC
    SD

CodePudding user response:

str(df['part']) returns the string representation of the series, not really what you want.

You can use apply:

df['group'] = df['part'].apply(lambda x: "-".join(re.findall(r'[A-za-z]{2,}', x)))

However, you can do most regex operations with .str. In your case, try:

df['group'] = (df['part'].str.extractall('([A-Za-z]{2,})')
                 .groupby(level=0)[0].agg('-'.join)
              )

CodePudding user response:

You can try with Series.str.findall and Series.str.join:

>>> df['group'] = df['part'].str.findall(r'[A-za-z]{2,}').str.join('-')
  • Related