Why is `groupby` with `as_index=False` even slower than `groupby` with `reset

I just recently came across this strange Pandas behavior with groupby.

I have this dataframe:

>>> df = pd.DataFrame({'a': [1, 2, 3, 1, 2, 3], 'b': [4, 5, 6, 7, 8, 9]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6
3  1  7
4  2  8
5  3  9
>>>

And I want to groupby the column a and sum the column b.

Normal groupby would have the column as index, but as_index=False won't:

>>> df.groupby('a')['b'].sum()
a
1    11
2    13
3    15
Name: b, dtype: int64
>>>

But when I time them:

>>> timeit(lambda: df.groupby('a')['b'].sum(), number=1000)
0.5426476000000093
>>> timeit(lambda: df.groupby('a')['b'].sum(), number=10000)
4.912795499999902
>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=1000)
1.419923899999958
>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=10000)
11.907147600000144
>>>

You can see that for some reason, groupby with as_index=False is 2.75x slower!

Not just that! It's even slower than reset_index!

>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=1000)
1.419923899999958
>>> timeit(lambda: df.groupby('a', as_index=False)['b'].sum(), number=10000)
11.907147600000144
>>> timeit(lambda: df.groupby('a')['b'].sum().reset_index(), number=1000)
1.0641113000001496
>>> timeit(lambda: df.groupby('a')['b'].sum().reset_index(), number=10000)
10.01520289999985
>>>

And reset_index obviously also gives the same output as as_index=False:

>>> df.groupby('a')['b'].sum().reset_index()
   a   b
0  1  11
1  2  13
2  3  15
>>>

as_index=False:

>>> df.groupby('a', as_index=False)['b'].sum()
   a   b
0  1  11
1  2  13
2  3  15
>>>

I can understand that as_index=False might be slower, but not this much slower... Also the main thing is that I can't wrap my head around that why is reset_index faster? That's an extra function...

Why is this? What is the implementation of as_index?

I am really surprised, I even thought that it's very possible that as_index=False would be faster than as_index=True, since it doesn't need a column to be set as the index.

But it's the opposite, it's actually as_index=True being 2.75 times faster... And even reset_index being faster than as_index=False.

If this is the case why doesn't as_index=False also just simply use reset_index?

CodePudding user response：

as_index=True is the default as the grouper uses internally an index and resets it if as_index is set toFalse:

cf. core.groupby.py source

        if not self.as_index:
            self._insert_inaxis_grouper_inplace(result)

The time difference between True/False is actually minimal, you should use a larger dataframe to test the speed.

Here on 600k rows:

True: 30.7 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

False: 32.5 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each

The difference is thus not proportional (2 times slower), but rather fixed (2.5 ms slower), which is less a burden.

Now as to why the source doesn't use reset_index, well pandas does a lot of things internally, I am only guessing here as the code is complex, but there are likely many checks in place that do more things than just resetting the index.

CodePudding user response：

Assign to @mozway's answer, the _insert_inaxis_grouper_inplace function is:

def _insert_inaxis_grouper_inplace(self, result: DataFrame) -> None:
    # zip in reverse so we can always insert at loc 0
    columns = result.columns
    for name, lev, in_axis in zip(
        reversed(self.grouper.names),
        reversed(self.grouper.get_group_levels()),
        reversed([grp.in_axis for grp in self.grouper.groupings]),
    ):
        # GH #28549
        # When using .apply(-), name will be in columns already
        if in_axis and name not in columns:
            result.insert(0, name, lev)

As you can see, it's using insert, not reset index.

So I suspect if it uses:

def _insert_inaxis_grouper_inplace(self, result: DataFrame) -> None:
    # zip in reverse so we can always insert at loc 0
    columns = result.columns
    for name, lev, in_axis in zip(
        reversed(self.grouper.names),
        reversed(self.grouper.get_group_levels()),
        reversed([grp.in_axis for grp in self.grouper.groupings]),
    ):
        # GH #28549
        # When using .apply(-), name will be in columns already
        if in_axis and name not in columns:
            result = result.reset_index()

Not completely sure about the setup though, but I expect the above to work...

Then it probably would be faster.