I have timestamped data in a dataframe of the form:
---- ------- -------
| ID | DATE | VALUE |
---- ------- -------
| 1 | 01-01 | 10 |
| 1 | 01-01 | 20 |
| 1 | 02-01 | 20 |
| 1 | 02-01 | 25 |
| 1 | 03-01 | 30 |
| 2 | 01-01 | 10 |
| 2 | 02-01 | 20 |
| 2 | 02-01 | 30 |
| 2 | 03-01 | 30 |
---- ------- -------
Within the grouping by ID
, I would now like to compute the difference of the value of each row to the average values of all rows that have equal or smaller dates, such that:
---- ------- ------- --------------------------------
| ID | DATE | VALUE | OUTPUT |
---- ------- ------- --------------------------------
| 1 | 01-01 | 10 | 10-AVERAGE(10, 20) |
| 1 | 01-01 | 20 | 20-AVERAGE(10, 20) |
| 1 | 02-01 | 20 | 20-AVERAGE(10, 20, 20, 25) |
| 1 | 02-01 | 25 | 25-AVERAGE(10, 20, 20, 25) |
| 1 | 03-01 | 30 | 30-AVERAGE(10, 20, 20, 25, 30) |
| 2 | 01-01 | 10 | 10-AVERAGE(10) |
| 2 | 02-01 | 20 | 20-AVERAGE(10, 20, 30) |
| 2 | 02-01 | 30 | 30-AVERAGE(10, 20, 30) |
| 2 | 03-01 | 30 | 30-AVERAGE(10, 20, 30, 30) |
---- ------- ------- --------------------------------
How can I achieve this with Pandas/NumPy withou creating a host of temporary columns?
I tried to create boolean masking columns and create temporary columns for every DATE
that are only partially filled, but this is rather cumbersome.
CodePudding user response:
You could write a custom function to compute the averages and then call it with groupby.apply
:
def expanding_average(frame):
average = frame.groupby("DATE")["VALUE"].sum().expanding(1).sum().div(frame.groupby("DATE")["VALUE"].count().expanding(1).sum())
return frame["DATE"].map(average)
df["Average"] = df["VALUE"].sub(df.groupby("ID").apply(expanding_average).droplevel(0))
>>> df
ID DATE VALUE Average
0 1 01-01 10 -5.00
1 1 01-01 20 5.00
2 1 02-01 20 1.25
3 1 02-01 25 6.25
4 1 03-01 30 9.00
5 2 01-01 10 0.00
6 2 02-01 20 0.00
7 2 02-01 30 10.00
8 2 03-01 30 7.50