Home > OS >  Faster data reformatting with groupby
Faster data reformatting with groupby

Time:03-04

So I have a DataFrame that looks something along these lines:

import pandas as pd

ddd = {
    'a': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'b': [22, 25, 18, 53, 19, 8, 75, 11, 49, 64],
    'c': [1, 1, 1, 2, 2, 3, 4, 4, 4, 5]
}
df = pd.DataFrame(ddd)

What I need is to group the data by the 'c' column and apply some data transformations. At the moment I'm doing this:

def do_stuff(d: pd.DataFrame):
    if d.shape[0] >= 2:
        return pd.DataFrame(
            {
                'start': [d.a.values[0]],
                'end': [d.a.values[d.shape[0] - 1]],
                'foo': [d.a.sum()],
                'bar': [d.b.mean()]
            }
        )
    else:
        return pd.DataFrame()

r = df.groupby('c').apply(lambda x: do_stuff(x))

Which gives the correct result:

     start  end   foo        bar
c                               
1 0    1.0  3.0   6.0  21.666667
2 0    4.0  5.0   9.0  36.000000
4 0    7.0  9.0  24.0  45.000000

The problem is that this approach appears to be too slow. On my actual data it runs in around 0.7 seconds which is too long and needs to be ideally much faster.

Is there any way I can do this faster? Or maybe there's some other faster method not involving groupby that I could use?

CodePudding user response:

We could first filter df for the "c" values that appear 2 or more times; then use groupby named aggregation:

msk = df['c'].value_counts() >= 2
out = (df[df['c'].isin(msk.index[msk])]
       .groupby('c')
       .agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))

You could also do:

out = (df[df.groupby('c')['c'].transform('count').ge(2)]
       .groupby('c')
       .agg(start=('a','first'), 
            end=('a','last'), 
            foo=('a','sum'), 
            bar=('b','mean')))

or

msk = df['c'].value_counts() >= 2
out = (df[df['c'].isin(msk.index[msk])]
       .groupby('c')
       .agg({'a':['first','last','sum'], 'b':'mean'})
       .set_axis(['start','end','foo','bar'], axis=1))

Output:

   start  end  foo        bar
c                            
1      1    3    6  21.666667
2      4    5    9  36.000000
4      7    9   24  45.000000

Some benchmarks:

>>> %timeit out = df.groupby('c').apply(lambda x: do_stuff(x))
6.49 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit msk = df['c'].value_counts() >= 2; out = (df[df['c'].isin(msk.index[msk])].groupby('c').agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
7.6 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit out = (df[df.groupby('c')['c'].transform('count').ge(2)].groupby('c').agg(start=('a','first'), end=('a','last'), foo=('a','sum'), bar=('b','mean')))
7.86 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit msk = df['c'].value_counts() >= 2; out = (df[df['c'].isin(msk.index[msk])].groupby('c').agg({'a':['first','last','sum'], 'b':'mean'}).set_axis(['start','end','foo','bar'], axis=1))
4.68 ms ± 57.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Related