This my current df. I would like to transform the dataframe in 3 steps. I need to remove duplicate time stamps but would like to keep the largest values or smallest values depending on the "Side" column. Please help :)
I have tried df= df[~df.index.duplicated(keep='first')]
however this doesn't have the option to keep max or min values
The type of the index is datetime format, Price is float and Side is integer, the data frame has 8000 rows.
Price Side
2021-12-13 00:00:03.285 51700 4
2021-12-13 00:00:03.315 51675 3
2021-12-13 00:00:03.333 50123 4
2021-12-13 00:00:03.333 50200 3
2021-12-13 00:00:03.333 50225 3
2021-12-13 00:00:03.333 50250 3
2021-12-13 00:00:03.421 50123 4
2021-12-13 00:00:03.421 50117 4
2021-12-13 00:00:03.421 50110 4
2021-12-13 00:00:03.671 50100 3
- If the time is duplicated, keep the highest value if the side is "3", if the time is duplicated and the side is "4" keep lowest value.
Desired Output:
Price Side
2021-12-13 00:00:03.285 51700 4
2021-12-13 00:00:03.315 51675 3
2021-12-13 00:00:03.333 50123 4
2021-12-13 00:00:03.333 50250 3
2021-12-13 00:00:03.421 50110 4
2021-12-13 00:00:03.671 50100 3
- Create new columns "3" and "4" with the corresponding prices
Desired Output:
Price 3 4
2021-12-13 00:00:03.285 51700 0 51700
2021-12-13 00:00:03.315 51675 51675 0
2021-12-13 00:00:03.333 50123 0 50123
2021-12-13 00:00:03.333 50250 50250 0
2021-12-13 00:00:03.421 50110 0 50110
2021-12-13 00:00:03.671 50100 50100 0
- Fill in the blanks with previous values from the same collumn
Desired Output:
Price 3 4
2021-12-13 00:00:03.285 51700 0 51700
2021-12-13 00:00:03.315 51675 51675 51700
2021-12-13 00:00:03.333 50123 51675 50123
2021-12-13 00:00:03.333 50250 50250 50123
2021-12-13 00:00:03.421 50110 50250 50110
2021-12-13 00:00:03.671 50100 50100 50110
CodePudding user response:
new_df = (df
.groupby([pd.Grouper(level=0), 'Side'])
.apply(lambda x: x['Price'].max() if x['Side'].mode()[0] == 3 else x['Price'].min())
.reset_index()
)
new_df = (
pd.concat([
new_df,
(new_df
.pivot(columns='Side', values=0)
.ffill()
.fillna(0)
)
], axis=1)
.drop('Side', axis=1)
.rename({0: 'Price'}, axis=1)
)
Output:
>>> df
index Price 3 4
0 2021-12-13 00:00:03.285 51700 0.0 51700.0
1 2021-12-13 00:00:03.315 51675 51675.0 51700.0
2 2021-12-13 00:00:03.333 50250 50250.0 51700.0
3 2021-12-13 00:00:03.333 50123 50250.0 50123.0
4 2021-12-13 00:00:03.421 50110 50250.0 50110.0
5 2021-12-13 00:00:03.671 50100 50100.0 50110.0
Compact version:
new_df = df.groupby([pd.Grouper(level=0), 'Side']).apply(lambda x: x['Price'].max() if x['Side'].mode()[0] == 3 else x['Price'].min()).reset_index()
new_df = pd.concat([new_df, new_df.pivot(columns='Side', values=0).ffill().fillna(0)], axis=1).drop('Side', axis=1).rename({0:'Price'}, axis=1))
CodePudding user response:
This is one option, which is a bit long:
(df.assign(temp = df.Side.map({4:'low', 3:'high'}))
.groupby([pd.Grouper(level=0), 'Side', 'temp'], sort = False)
.Price
.agg(['min', 'max'])
.unstack('Side')
.loc(axis=1)[[('max', 3), ('min', 4)]]
.droplevel(level = 0,axis = 1)
.droplevel(level = 'temp')
.assign(Price=lambda df: df[3].where(df[3].notna(), df[4]))
.ffill()
.fillna(0)
.astype(int)
.rename_axis(columns = None)
)
3 4 Price
2021-12-13 00:00:03.285 0 51700 51700
2021-12-13 00:00:03.315 51675 51700 51675
2021-12-13 00:00:03.333 51675 50123 50123
2021-12-13 00:00:03.333 50250 50123 50250
2021-12-13 00:00:03.421 50250 50110 50110
2021-12-13 00:00:03.671 50100 50110 50100
This assumes that the only values in Side
are 3 and 4.