Home > Blockchain >  Pandas: Sum down a column based on other columns
Pandas: Sum down a column based on other columns

Time:10-22

I've managed to successfully confuse myself with this problem. I have the following dataframe:

    Model   Rank    Prediction  Runtime
0   0.05    1   0.516267    250.500
1   0.05    2   0.504968    253.875
2   0.05    3   0.482915    310.875
3   0.05    4   0.470865    251.375
4   0.05    5   0.459580    277.250
... ... ... ... ...
995 0.50    96  0.130696    250.500
996 0.50    97  0.130696    220.375
997 0.50    98  0.130696    314.625
998 0.50    99  0.130696    232.000
999 0.50    100 0.130696    258.000

And my use case is as follows:

I would, for each Model, like to calculate the total Runtime with respect to its Rank. By that I mean, the Runtime at Rank 1 should be the sum of all Runtimes (for its respective Model) and the Runtime at Rank 100 should be only the Runtime for Rank 100 (for its respective Model).

So for instance,

  • If the Rank is 1, the Runtime column at that row should represent the total sum of all Runtimes for Model 0.05
  • If the Rank is 2, it should be all of the Runtimes for Model 0.05 minus the Runtime for Model 0.05 at Rank 1 ...
  • If the Rank is 100, it should be only the Runtime for Model 0.05 at Rank 100.

I have the idea in my head but I'm not sure how this is achieved in Pandas. I know how to sum the column, but not to sum based on a condition like this. If any more data or explanation is required, I'd be happy to attach it.

CodePudding user response:

If I understand correctly, what you're asking for is essentially a reversed cumulative sum, which you can do by a reverse, cumsum, reverse operation:

In [4]: df["model_runtimes"] = df[::-1].groupby("Model")["Runtime"].cumsum()[::-1]

In [5]: df
Out[5]:
   Model  Rank  Prediction  Runtime  model_runtimes
0   0.05     1    0.516267  250.500        1343.875
1   0.05     2    0.504968  253.875        1093.375
2   0.05     3    0.482915  310.875         839.500
3   0.05     4    0.470865  251.375         528.625
4   0.05     5    0.459580  277.250         277.250
5   0.50    96    0.130696  250.500        1275.500
6   0.50    97    0.130696  220.375        1025.000
7   0.50    98    0.130696  314.625         804.625
8   0.50    99    0.130696  232.000         490.000
9   0.50   100    0.130696  258.000         258.000

CodePudding user response:

I would frame your problem in two steps.

First, each model is independent, so you can split the dataframe by the model field, and solve each one independently. This is a good place to use groupby().

Second, your problem is like a cumulative sum, except that a normal cumulative sum starts at the top and carries the sum down, and you want to do the opposite. You can solve this by reversing the dataframe, or by sorting in descending order.

With that in mind, here's how I would approach this problem. (Lines 1-14 are just setting up the dataset.)

import pandas as pd
import numpy as np

np.random.seed(42)
# Only used for setting up dataframe. Ignore.
def flatten(t):
    return [item for sublist in t for item in sublist]

df = pd.DataFrame({
    "Model": flatten([list(c * 20) for c in "ABCDE"]),
    "Rank": flatten([range(1, 21) for i in range(5)]),
    "Prediction": np.random.rand(100),
    "Runtime": np.random.rand(100),
})



def add_sum(d):
    d["Runtime"] = d["Runtime"].cumsum()
    return d

df = df.sort_values(by=["Model", "Rank"], ascending=False) \
    .groupby("Model") \
    .apply(add_sum) \
    .sort_index()

print(df)
  • Related