How do I take a grouped pandas dataframe and pull out minimum value for each group?-CodePudding

I have a grouped dataframe consisting of a multilevel index of items (title ord_base7), a snapshot date of when sales forecasts were made, and the different models that made those forecasts along with each model's error (MAPE).

I'd like to get simplify the data frame to the model with the smallest MAPE for each ord_base7/snapshot date.

Below is a sample of the data for one item.

import pandas as pd

df_dict = {'ord_base7': ['100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',
  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',  '100179K',
  '100179K',  '100179K',  '100179K',  '100179K','100179K',  '100179K',  '100179K',  '100179K',
  '100179K',  '100179K',  '100179K',  '100179K',  '100179K'],
 'snapshot': ['2020-12-31 00:00:00',  '2020-12-31 00:00:00',
  '2020-12-31 00:00:00',  '2020-12-31 00:00:00',
  '2020-12-31 00:00:00',  '2020-12-31 00:00:00',
  '2020-12-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-01-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-01-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-01-31 00:00:00',  '2021-01-31 00:00:00',
  '2021-02-28 00:00:00',  '2021-02-28 00:00:00',
  '2021-02-28 00:00:00',  '2021-02-28 00:00:00',
  '2021-02-28 00:00:00',  '2021-02-28 00:00:00',
  '2021-02-28 00:00:00',  '2021-03-31 00:00:00',
  '2021-03-31 00:00:00',  '2021-03-31 00:00:00',
  '2021-03-31 00:00:00',  '2021-03-31 00:00:00',
  '2021-03-31 00:00:00'],
 'model': ['AutoArima',  'SARIMA',  'STLF',  'TESA',  'TESM',  'lstm_neural_net',  'prophet',
  'AutoArima',  'SARIMA',  'STLF',  'TESA','TESM',  'lstm_neural_net',  'prophet',
  'AutoArima',  'SARIMA',  'STLF',  'TESA',  'TESM',  'lstm_neural_net',  'prophet',
  'AutoArima',  'SARIMA',  'STLF',  'TESA',  'TESM',  'lstm_neural_net'],
 'MAPE': [71.50999999999999,  4439.13,  1.34,  96.31,  100.13000000000001,  70.17999999999999,
  53.25,  18.32,  20227.11,  12.15,  92.25999999999999,  93.14,  12.2,  11.3,  8.99,  20664.37,
  18.68,  15.8,  20.39,  1.69,  25.679999999999996,  3.0700000000000003,  8.68,  24.6,  5.43,
  2.93,  1.87]}

df = pd.DataFrame.from_dict(df_dict)
df.set_index(['ord_base7', 'snapshot', 'model'],inplace = True)

I tried to just pull the min(MAPE) out but that's over the whole dataset. I need to pull it for each snapshot in time for each item so I know what model performed the best at that time.

How can I do this?

CodePudding user response：

Groupby and use the .min() function along it (you also have to stop setting the index to snapshot as you can't use it in a .groupby() selection):

# ... code before
df.set_index(['ord_base7', 'model'],inplace = True)
print(df.loc[df['MAPE'].isin(df.groupby('snapshot')['MAPE'].min().tolist())])

gives:

                                      snapshot   MAPE
ord_base7 model                                      
100179K   STLF             2020-12-31 00:00:00   1.34
          prophet          2021-01-31 00:00:00  11.30
          lstm_neural_net  2021-02-28 00:00:00   1.69
          lstm_neural_net  2021-03-31 00:00:00   1.87

Why this method is better

Using the datetime module to time this and the other answer, I found out that my method is quicker on average:

My method

from datetime import datetime

start_time = datetime.now()

#... code after
print(df.loc[df['MAPE'].isin(df.groupby('snapshot')['MAPE'].min().tolist())])

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Gave:

Duration: 0:00:00.034470

or 0.034 seconds.

Other method

from datetime import datetime

start_time = datetime.now()

#... code after
print(df.loc[df.groupby(level='snapshot')['MAPE'].idxmin()])

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Gave:

Duration: 0:00:00.092165

or 0.092 seconds.

So

My method < Other method

  0.034   <    0.092

CodePudding user response：

You can use idxmin:

>>> df.loc[df.groupby(level='snapshot')['MAPE'].idxmin()]
                                                MAPE
ord_base7 snapshot            model                 
100179K   2020-12-31 00:00:00 STLF              1.34
          2021-01-31 00:00:00 prophet          11.30
          2021-02-28 00:00:00 lstm_neural_net   1.69
          2021-03-31 00:00:00 lstm_neural_net   1.87

or df.loc[df.groupby(level=['ord_base7', 'snapshot'])['MAPE'].idxmin()] depends on what "each item" means.

Update

You can also use df.groupby(level=['ord_base7', 'snapshot'])['MAPE'].nsmallest(1)