So I have a DataFrame that looks something along these lines:
import pandas as pd
ddd = {
'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'b': [22, 25, 18, 53, 19, 8, 75, 11, 49, 64],
'c': [1, 1, 1, 2, 2, 3, 4, 4, 4, 5]
}
df = pd.DataFrame(ddd)
What I need is to group the data by the 'c'
column and apply some data transformations. At the moment I'm doing this:
def do_stuff(d: pd.DataFrame):
if d.shape[0] >= 2:
return pd.DataFrame(
{
'start': [d.a.values[0]],
'end': [d.a.values[d.shape[0] - 1]],
'foo': [d.a.sum()],
'bar': [d.b.mean()]
}
)
else:
return pd.DataFrame()
r = df.groupby('c').apply(lambda x: do_stuff(x))
Which gives the correct result:
start end foo bar
c
1 0 1.0 3.0 6.0 21.666667
2 0 4.0 5.0 9.0 36.000000
4 0 7.0 9.0 24.0 45.000000
The problem is that this approach appears to be too slow. On my actual data it runs in around 0.7 seconds which is too long and needs to be ideally much faster.
Is there any way I can do this faster? Or maybe there's some other faster method not involving groupby
that I could use?
CodePudding user response:
We could first filter df
for the "c" values that appear 2 or more times; then use groupby
named aggregation:
msk = df['c'].value_counts() >= 2
out = (df[df['c'].isin(msk.index[msk])]
.groupby('c')
.agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
You could also do:
out = (df[df.groupby('c')['c'].transform('count').ge(2)]
.groupby('c')
.agg(start=('a','first'),
end=('a','last'),
foo=('a','sum'),
bar=('b','mean')))
or
msk = df['c'].value_counts() >= 2
out = (df[df['c'].isin(msk.index[msk])]
.groupby('c')
.agg({'a':['first','last','sum'], 'b':'mean'})
.set_axis(['start','end','foo','bar'], axis=1))
Output:
start end foo bar
c
1 1 3 6 21.666667
2 4 5 9 36.000000
4 7 9 24 45.000000
Some benchmarks:
>>> %timeit out = df.groupby('c').apply(lambda x: do_stuff(x))
6.49 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit msk = df['c'].value_counts() >= 2; out = (df[df['c'].isin(msk.index[msk])].groupby('c').agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
7.6 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit out = (df[df.groupby('c')['c'].transform('count').ge(2)].groupby('c').agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
7.86 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit msk = df['c'].value_counts() >= 2; out = (df[df['c'].isin(msk.index[msk])].groupby('c').agg({'a':['first','last','sum'], 'b':'mean'}).set_axis(['start','end','foo','bar'], axis=1))
4.68 ms ± 57.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)