I have a grouped dataframe consisting of a multilevel index of items (title ord_base7), a snapshot date of when sales forecasts were made, and the different models that made those forecasts along with each model's error (MAPE).
I'd like to get simplify the data frame to the model with the smallest MAPE for each ord_base7/snapshot date.
Below is a sample of the data for one item.
import pandas as pd
df_dict = {'ord_base7': ['100179K', '100179K', '100179K', '100179K', '100179K', '100179K',
'100179K', '100179K', '100179K', '100179K', '100179K', '100179K', '100179K', '100179K',
'100179K', '100179K', '100179K', '100179K','100179K', '100179K', '100179K', '100179K',
'100179K', '100179K', '100179K', '100179K', '100179K'],
'snapshot': ['2020-12-31 00:00:00', '2020-12-31 00:00:00',
'2020-12-31 00:00:00', '2020-12-31 00:00:00',
'2020-12-31 00:00:00', '2020-12-31 00:00:00',
'2020-12-31 00:00:00', '2021-01-31 00:00:00',
'2021-01-31 00:00:00', '2021-01-31 00:00:00',
'2021-01-31 00:00:00', '2021-01-31 00:00:00',
'2021-01-31 00:00:00', '2021-01-31 00:00:00',
'2021-02-28 00:00:00', '2021-02-28 00:00:00',
'2021-02-28 00:00:00', '2021-02-28 00:00:00',
'2021-02-28 00:00:00', '2021-02-28 00:00:00',
'2021-02-28 00:00:00', '2021-03-31 00:00:00',
'2021-03-31 00:00:00', '2021-03-31 00:00:00',
'2021-03-31 00:00:00', '2021-03-31 00:00:00',
'2021-03-31 00:00:00'],
'model': ['AutoArima', 'SARIMA', 'STLF', 'TESA', 'TESM', 'lstm_neural_net', 'prophet',
'AutoArima', 'SARIMA', 'STLF', 'TESA','TESM', 'lstm_neural_net', 'prophet',
'AutoArima', 'SARIMA', 'STLF', 'TESA', 'TESM', 'lstm_neural_net', 'prophet',
'AutoArima', 'SARIMA', 'STLF', 'TESA', 'TESM', 'lstm_neural_net'],
'MAPE': [71.50999999999999, 4439.13, 1.34, 96.31, 100.13000000000001, 70.17999999999999,
53.25, 18.32, 20227.11, 12.15, 92.25999999999999, 93.14, 12.2, 11.3, 8.99, 20664.37,
18.68, 15.8, 20.39, 1.69, 25.679999999999996, 3.0700000000000003, 8.68, 24.6, 5.43,
2.93, 1.87]}
df = pd.DataFrame.from_dict(df_dict)
df.set_index(['ord_base7', 'snapshot', 'model'],inplace = True)
I tried to just pull the min(MAPE) out but that's over the whole dataset. I need to pull it for each snapshot in time for each item so I know what model performed the best at that time.
How can I do this?
CodePudding user response:
Groupby and use the .min()
function along it (you also have to stop setting the index to snapshot
as you can't use it in a .groupby()
selection):
# ... code before
df.set_index(['ord_base7', 'model'],inplace = True)
print(df.loc[df['MAPE'].isin(df.groupby('snapshot')['MAPE'].min().tolist())])
gives:
snapshot MAPE
ord_base7 model
100179K STLF 2020-12-31 00:00:00 1.34
prophet 2021-01-31 00:00:00 11.30
lstm_neural_net 2021-02-28 00:00:00 1.69
lstm_neural_net 2021-03-31 00:00:00 1.87
Why this method is better
Using the datetime module to time this and the other answer, I found out that my method is quicker on average:
My method
from datetime import datetime
start_time = datetime.now()
#... code after
print(df.loc[df['MAPE'].isin(df.groupby('snapshot')['MAPE'].min().tolist())])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
Gave:
Duration: 0:00:00.034470
or 0.034 seconds.
Other method
from datetime import datetime
start_time = datetime.now()
#... code after
print(df.loc[df.groupby(level='snapshot')['MAPE'].idxmin()])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
Gave:
Duration: 0:00:00.092165
or 0.092 seconds.
So
My method < Other method
0.034 < 0.092
CodePudding user response:
You can use idxmin
:
>>> df.loc[df.groupby(level='snapshot')['MAPE'].idxmin()]
MAPE
ord_base7 snapshot model
100179K 2020-12-31 00:00:00 STLF 1.34
2021-01-31 00:00:00 prophet 11.30
2021-02-28 00:00:00 lstm_neural_net 1.69
2021-03-31 00:00:00 lstm_neural_net 1.87
or df.loc[df.groupby(level=['ord_base7', 'snapshot'])['MAPE'].idxmin()]
depends on what "each item" means.
Update
You can also use df.groupby(level=['ord_base7', 'snapshot'])['MAPE'].nsmallest(1)