How can I speed up a pandas groupby that is performing a sum on more than one column?-CodePudding

Ask

I would like to speed up a pandas groupby that is also applying a summation on two columns and have the resulting dataframe returned.

Code

df = df.groupby(['key','code','name','period','agg_metric'], sort=False, observed=True, dropna=False)[['metricA','metricB']]\
.sum().reset_index()

(The method currently takes 2 minutes to process the data for my largest use case.)

Data

Overall, the largest sized dataframe has about 1.5 million rows upon which the groupby is applied. Period and agg_metric can be inferred from each other, of which there are only 2 period values (and thus 2 agg_metric values). The name value can also be inferred from code.

After the groupby, I'm left with 700k records. If I understand correctly, the slowdown is due to the number of resultant groups which are processed. Is there a possible way to vectorize this method and apply the summation to each group at once, rather than what I'm assuming is currently being iterated.

Notes

I have tried using groupby().agg({...}) and groupby().apply(lambda), and both take about the same amount of time. I've also tried removing some of the groupby columns and then add them back later, but it did not speed up the calcuation, so it didn't warrant taking them out of the groupby. The snippet also has the sort=False and observed=True, but both did little to improve processing time.

I've thoroughly gone through as many resources I can (especially this great reference: General Groupby in Python Pandas: Fast way). I'm fairly new to vectorization, and am doing this as I'm off-loading several queries from our DB.

CodePudding user response：

What type of data do you have? It looks like the columns metricA / metricB are of type object, and pandas performs slow summation for Python objects rather than fast summation for numpy arrays. Try to convert metric columns to float64 or integer type.

You can inspect data types using df.info() method.

Proof:

from string import ascii_letters
from time import time

import numpy as np
import pandas as pd
from numpy.random import choice

N = 1_500_000
np.random.seed(123)
letters = list(ascii_letters)
words = ["".join(choice(letters, 5)) for i in range(30)]

df = pd.DataFrame(
    {
        "key": choice(words, N),
        "code": choice(words, N),
        "name": choice(words, N),
        "period": np.random.randint(0, 10, N),
        "agg_metric": choice(["mean", "sum", "count"], N),
        "metricA": np.random.rand(N),
        "metricB": np.random.rand(N),
    }
)

def aggregate(df):
    return (
        df.groupby(
            ["key", "code", "name", "period", "agg_metric"],
            sort=False,
            observed=True,
            dropna=False,
        )[["metricA", "metricB"]]
        .sum()
        .reset_index()
    )

start = time()
df2 = aggregate(df)
print(f"sum floats took {time() - start}")

start = time()
df3 = aggregate(df.astype({"metricA": object, "metricB": object}))
print(f"sum objects took {time() - start}")

assert df2.equals(df3)

Output:

sum floats took 0.2983248233795166
sum objects took 81.04267287254333