pandas split string and extract upto n index position-CodePudding

I have my input data like as below stored in a dataframe column

active_days_revenue
active_days_rate
total_revenue
gap_days_rate

I would like to do the below

a) split the string using _ delimiter

b) extract n elements from the delimiter

So, I tried the below

df['text'].split('_')[:1]  # but this doesn't work
df['text'].split('_')[0] # this works but returns only the 1st element

I expect my output like below. Instead of just getting items based on 0 index position, I would like to get from 0 to 1st index position

active_days
active_days
total_revenue
gap_days

CodePudding user response：

You can use str.extract with a dynamic regex (fastest):

N = 2
df['out'] = df['text'].str.extract(fr'([^_] (?:_[^_] ){{,{N-1}}})', expand=False)

Or slicing and agg:

df['out'] = df['text'].str.split('_').str[:2].agg('_'.join)

Or str.extractall and groupby.agg:

df['out'] = df['text'].str.extractall('([^_] )')[0].groupby(level=0).agg(lambda x: '_'.join(x.head(2)))

Output:

                  text            out
0  active_days_revenue    active_days
1     active_days_rate    active_days
2        total_revenue  total_revenue
3        gap_days_rate       gap_days

timings

On 4k rows:

# extract
2.17 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# split/slice/agg
3.56 ms ± 811 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# extractall
361 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

using as grouper for columns

import re
N = 2
df1.groupby(lambda x: m.group() if (m:=re.match(fr'([^_] (?:_[^_] ){{,{N-1}}})', x)) else x, axis=1, sort=False)