In each new row of dataframe, I need to keep track of min and max values for a previous group of records.
- Create dataframe with input data:
import pandas as pd
columns = ['timestamp','groupid','value']
data = [['2022-10-14 11:47:38',1000,200],
['2022-10-14 11:47:39',1000,210],
['2022-10-14 11:47:40',1000,220],
['2022-10-14 11:47:41',1000,230],
['2022-10-14 11:47:42',1001,240],
['2022-10-14 11:47:43',1001,250],
['2022-10-14 11:47:44',1002,260],
['2022-10-14 11:47:45',1002,270]]
df = pd.DataFrame(data=data,columns=columns)
print(df)
- Calculate min and max values for each group:
df['min'] = df.groupby('groupid')['value'].transform('min')
df['max'] = df.groupby('groupid')['value'].transform('max')
- Create new columns in dataframe:
df['pmin'] = 0
df['pmax'] = 0
df['ppmin'] = 0
df['ppmax'] = 0
df['ppmin'] = df.groupby('groupid')['pmin'].transform('first').shift(1)
df['ppmax'] = df.groupby('groupid')['pmax'].transform('first').shift(1)
df['pmin'] = df.groupby('groupid')['min'].transform('first').shift(1)
df['pmax'] = df.groupby('groupid')['max'].transform('first').shift(1)
print(df)
The code in step 3 fails to return expected result as shown below:
columns2 = ['timestamp','groupid','value','min','max','pmin','pmax','ppmin','ppmax']
data2 = [['2022-10-14 11:47:38',1000,200,200,230,0,0,0,0],
['2022-10-14 11:47:39',1000,210,200,230,0,0,0,0],
['2022-10-14 11:47:40',1000,220,200,230,0,0,0,0],
['2022-10-14 11:47:41',1000,230,200,230,0,0,0,0],
['2022-10-14 11:47:42',1001,240,240,250,200,230,0,0],
['2022-10-14 11:47:43',1001,250,240,250,200,230,0,0],
['2022-10-14 11:47:44',1002,260,260,270,240,250,200,230],
['2022-10-14 11:47:45',1002,270,260,270,240,250,200,230]]
df = pd.DataFrame(data=data2,columns=columns2)
print(df)
CodePudding user response:
Build the dataframe of aggregation and shifts, and merge to get the final output:
grp = df.groupby('groupid')['value']
min_max = grp.agg(['min', 'max'])
pmin_max = min_max.add_prefix('p').shift(fill_value = 0)
ppmin_max = min_max.add_prefix('p').shift(2, fill_value = 0).add_prefix('p')
other = pd.concat([min_max, pmin_max, ppmin_max], axis = 1)
df.merge(other, on = 'groupid')
timestamp groupid value min max pmin pmax ppmin ppmax
0 2022-10-14 11:47:38 1000 200 200 230 0 0 0 0
1 2022-10-14 11:47:39 1000 210 200 230 0 0 0 0
2 2022-10-14 11:47:40 1000 220 200 230 0 0 0 0
3 2022-10-14 11:47:41 1000 230 200 230 0 0 0 0
4 2022-10-14 11:47:42 1001 240 240 250 200 230 0 0
5 2022-10-14 11:47:43 1001 250 240 250 200 230 0 0
6 2022-10-14 11:47:44 1002 260 260 270 240 250 200 230
7 2022-10-14 11:47:45 1002 270 260 270 240 250 200 230
CodePudding user response:
Maybe not the prettiest solution:
df["tmp"] = (df["groupid"] != df["groupid"].shift()).cumsum()
grps = df.groupby("groupid")["value"]
df["min"] = grps.transform("min")
df["max"] = grps.transform("max")
mapper = df.drop_duplicates("tmp").set_index("tmp")[["min", "max"]]
tmp1 = df["tmp"] - 1
tmp2 = df["tmp"] - 2
df["pmin"] = tmp1.map(mapper["min"])
df["pmax"] = tmp1.map(mapper["max"])
df["ppmin"] = tmp2.map(mapper["min"])
df["ppmax"] = tmp2.map(mapper["max"])
print(df.fillna(0).drop(columns="tmp"))
Prints:
timestamp groupid value min max pmin pmax ppmin ppmax
0 2022-10-14 11:47:38 1000 200 200 230 0.0 0.0 0.0 0.0
1 2022-10-14 11:47:39 1000 210 200 230 0.0 0.0 0.0 0.0
2 2022-10-14 11:47:40 1000 220 200 230 0.0 0.0 0.0 0.0
3 2022-10-14 11:47:41 1000 230 200 230 0.0 0.0 0.0 0.0
4 2022-10-14 11:47:42 1001 240 240 250 200.0 230.0 0.0 0.0
5 2022-10-14 11:47:43 1001 250 240 250 200.0 230.0 0.0 0.0
6 2022-10-14 11:47:44 1002 260 260 270 240.0 250.0 200.0 230.0
7 2022-10-14 11:47:45 1002 270 260 270 240.0 250.0 200.0 230.0