Pandas groupby perform computation that uses multiple rows and columns per group-CodePudding

Consider the following data:

,       Animal,     Color,      Rank,       X
0,         c,         b,         1,         9
1,         c,         b,         2,         8
2,         c,         b,         3,         7
3,         c,         r,         1,         6
4,         c,         r,         2,         5
5,         c,         r,         3,         4
6,         d,         g,         1,         3
7,         d,         g,         2,         2
8,         d,         g,         3,         1

I now want to group by ["Animal", "Color"] and, for every group, I want to subtract the X value that corresponds to Rank equal to 1 from every other X value in that group.

Currently I am looping like this:

dfs = []
for _, tmp in df.groupby(["Animal","Color"]):
    baseline = tmp.loc[tmp["Rank"]==1,"X"].to_numpy()
    tmp["Y"] = tmp["X"]-baseline 
    dfs.append(tmp)


dfs = pd.concat(dfs)

This yields the right result, i.e.,

The whole process is however really slow and I would prefer to use apply or transform instead. My problem is that I am unable to find a way to use the whole grouped data within apply or transform.

Is there a way to accelerate my computation?

For completeness, here's my MWE:

df = pd.DataFrame(
    {
        "Animal": {0: "c", 1: "c", 2: "c", 3: "c", 4: "c", 5: "c", 6: "d", 7: "d", 8: "d"},
        "Color": {0: "b", 1: "b", 2: "b", 3: "r", 4: "r", 5: "r", 6: "g", 7: "g", 8: "g"},
        "Rank": {0: 1, 1: 2, 2: 3, 3: 1, 4: 2, 5: 3, 6: 1, 7: 2, 8: 3},
        "X": {0: 9, 1: 8, 2: 7, 3: 6, 4: 5, 5: 4, 6: 3, 7: 2, 8: 1},
    }
)

CodePudding user response：

Maybe it's the same as OP's solution performance-wise, but a little bit shorter:

# Just to be sure that we won't mess up the ordering after groupby
df.sort_values(['Animal', 'Color', 'Rank'], inplace=True)
df['Y'] = df['X'] - df.groupby(['Animal', 'Color']).transform('first')['X']

CodePudding user response：

I think I found a solution that is faster (at least in my use case):

# get a unique identifier for every group
df["_group"] = df.groupby(["Animal", "Color"]).ngroup()

# for every group, get that identifier and the value X to be subtracted
baseline = df.loc[df["Rank"] == 1, ["_group", "X"]]

# merge the original data and the baseline data on the group
# this gives a new column with the Rank==1 value of X
df = pd.merge(df, baseline, on="_group", suffixes=("", "_baseline"))

# perform arithmetic
df["Y"] = df["X"] - df["X_baseline"]

# drop intermediate columns
df.drop(columns=["_group", "X_baseline"], inplace=True)