I've managed to successfully confuse myself with this problem. I have the following dataframe:
Model Rank Prediction Runtime
0 0.05 1 0.516267 250.500
1 0.05 2 0.504968 253.875
2 0.05 3 0.482915 310.875
3 0.05 4 0.470865 251.375
4 0.05 5 0.459580 277.250
... ... ... ... ...
995 0.50 96 0.130696 250.500
996 0.50 97 0.130696 220.375
997 0.50 98 0.130696 314.625
998 0.50 99 0.130696 232.000
999 0.50 100 0.130696 258.000
And my use case is as follows:
I would, for each Model
, like to calculate the total Runtime
with respect to its Rank
. By that I mean, the Runtime
at Rank
1 should be the sum of all Runtime
s (for its respective Model
) and the Runtime
at Rank
100 should be only the Runtime
for Rank
100 (for its respective Model
).
So for instance,
- If the
Rank
is1
, theRuntime
column at that row should represent the total sum of allRuntime
s forModel
0.05 - If the
Rank
is 2, it should be all of theRuntime
s forModel
0.05 minus theRuntime
forModel
0.05 atRank
1 ... - If the
Rank
is 100, it should be only theRuntime
forModel
0.05 atRank
100.
I have the idea in my head but I'm not sure how this is achieved in Pandas. I know how to sum the column, but not to sum based on a condition like this. If any more data or explanation is required, I'd be happy to attach it.
CodePudding user response:
If I understand correctly, what you're asking for is essentially a reversed cumulative sum, which you can do by a reverse, cumsum, reverse operation:
In [4]: df["model_runtimes"] = df[::-1].groupby("Model")["Runtime"].cumsum()[::-1]
In [5]: df
Out[5]:
Model Rank Prediction Runtime model_runtimes
0 0.05 1 0.516267 250.500 1343.875
1 0.05 2 0.504968 253.875 1093.375
2 0.05 3 0.482915 310.875 839.500
3 0.05 4 0.470865 251.375 528.625
4 0.05 5 0.459580 277.250 277.250
5 0.50 96 0.130696 250.500 1275.500
6 0.50 97 0.130696 220.375 1025.000
7 0.50 98 0.130696 314.625 804.625
8 0.50 99 0.130696 232.000 490.000
9 0.50 100 0.130696 258.000 258.000
CodePudding user response:
I would frame your problem in two steps.
First, each model is independent, so you can split the dataframe by the model field, and solve each one independently. This is a good place to use groupby()
.
Second, your problem is like a cumulative sum, except that a normal cumulative sum starts at the top and carries the sum down, and you want to do the opposite. You can solve this by reversing the dataframe, or by sorting in descending order.
With that in mind, here's how I would approach this problem. (Lines 1-14 are just setting up the dataset.)
import pandas as pd
import numpy as np
np.random.seed(42)
# Only used for setting up dataframe. Ignore.
def flatten(t):
return [item for sublist in t for item in sublist]
df = pd.DataFrame({
"Model": flatten([list(c * 20) for c in "ABCDE"]),
"Rank": flatten([range(1, 21) for i in range(5)]),
"Prediction": np.random.rand(100),
"Runtime": np.random.rand(100),
})
def add_sum(d):
d["Runtime"] = d["Runtime"].cumsum()
return d
df = df.sort_values(by=["Model", "Rank"], ascending=False) \
.groupby("Model") \
.apply(add_sum) \
.sort_index()
print(df)