I start with a regex to extract matches of strings without numbers with length of 2 or more and join them to make a new field
import re
# testing regex here
part = '24C-435-SA'
part1 = '24CE-435-SA'
group = "-".join(re.findall(r'[A-za-z]{2,}', part))
group1 = "-".join(re.findall(r'[A-za-z]{2,}', part1))
print("Group:", group, "Group1:", group1)
and result is Group: SA Group1: CE-SA
All good there, now I'm trying to apply the same regex on a pandas dataframe df and running into problems as below
part unit_price
0 24C-435-SA 10
1 30C-435-SB 20
2 36C-435-SC 30
3 42C-435-SD 40
Now I try the new column:
df['group'] = "-".join(re.findall(r'[A-za-z]{2,}', str(df['part'])))
or even different syntax to avoid some pandas warnings:
df.loc[:, 'group'] = "-".join(re.findall(r'[A-za-z]{2,}', str(df['part'])))
and I get this:
part unit_price group
0 24C-435-SA 10 SA-SB-SC-SD-Name-part-dtype-object
1 30C-435-SB 20 SA-SB-SC-SD-Name-part-dtype-object
2 36C-435-SC 30 SA-SB-SC-SD-Name-part-dtype-object
3 42C-435-SD 40 SA-SB-SC-SD-Name-part-dtype-object
I see it keeps joining all rows matches together instead of expected values:
group
SA
SB
SC
SD
CodePudding user response:
str(df['part'])
returns the string representation of the series, not really what you want.
You can use apply
:
df['group'] = df['part'].apply(lambda x: "-".join(re.findall(r'[A-za-z]{2,}', x)))
However, you can do most regex operations with .str
. In your case, try:
df['group'] = (df['part'].str.extractall('([A-Za-z]{2,})')
.groupby(level=0)[0].agg('-'.join)
)
CodePudding user response:
You can try with Series.str.findall
and Series.str.join
:
>>> df['group'] = df['part'].str.findall(r'[A-za-z]{2,}').str.join('-')