I have my input data like as below stored in a dataframe column
active_days_revenue
active_days_rate
total_revenue
gap_days_rate
I would like to do the below
a) split the string using _
delimiter
b) extract n
elements from the delimiter
So, I tried the below
df['text'].split('_')[:1] # but this doesn't work
df['text'].split('_')[0] # this works but returns only the 1st element
I expect my output like below. Instead of just getting items based on 0 index position
, I would like to get from 0 to 1st index position
active_days
active_days
total_revenue
gap_days
CodePudding user response:
You can use str.extract
with a dynamic regex (fastest):
N = 2
df['out'] = df['text'].str.extract(fr'([^_] (?:_[^_] ){{,{N-1}}})', expand=False)
Or slicing and agg
:
df['out'] = df['text'].str.split('_').str[:2].agg('_'.join)
Or str.extractall
and groupby.agg
:
df['out'] = df['text'].str.extractall('([^_] )')[0].groupby(level=0).agg(lambda x: '_'.join(x.head(2)))
Output:
text out
0 active_days_revenue active_days
1 active_days_rate active_days
2 total_revenue total_revenue
3 gap_days_rate gap_days
timings
On 4k rows:
# extract
2.17 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# split/slice/agg
3.56 ms ± 811 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# extractall
361 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
using as grouper for columns
import re
N = 2
df1.groupby(lambda x: m.group() if (m:=re.match(fr'([^_] (?:_[^_] ){{,{N-1}}})', x)) else x, axis=1, sort=False)