Home > Enterprise >  np.argsort and pd.nsmallest give different results
np.argsort and pd.nsmallest give different results

Time:11-13

With example data and code below:

import pandas as pd
import numpy as np

np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 4)), index=dates, columns=['A_values', 'B_values', 'C_values', 'target'])

# function to calculate mape
def mape(y_true, y_pred):
    y_pred = np.array(y_pred)
    return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
                   axis=0)*100

preds = df.columns[df.columns.str.endswith('_values')]
k = 2
print(df)

Out:

            A_values  B_values  C_values    target
2013-02-26  6.059783  7.333694  1.389472  3.126731
2013-02-27  9.972433  1.281624  1.789931  7.529254
2013-02-28  6.621605  7.843101  0.968944  0.585713
2013-03-01  9.623960  6.165574  0.866300  5.612724
2013-03-02  6.165247  9.638430  5.743043  3.711608
             ...       ...       ...       ...
2013-05-22  0.589729  6.479978  3.531450  6.872059
2013-05-23  6.279065  3.837670  8.853146  8.209883
2013-05-24  5.533017  5.241127  1.388056  5.355926
2013-05-25  1.596038  4.665995  2.406251  1.971875
2013-05-26  3.269001  1.787529  6.659690  7.545569

[90 rows x 4 columns]

I will calculate mape and find 2 lowest error values for each year/month group using two different method:

Method 1:

def grpProc(grp):
    err = mape(grp[preds], grp[['target']])
    print(err)
    sort_args = np.argsort(err, axis=1) < k
    cols = preds[sort_args]
    print(cols)
    print('-'*50)

df.groupby(pd.Grouper(freq='M')).apply(grpProc)

Out:

A_values     54.685258
B_values    212.458242
C_values    161.332752
dtype: float64
Index(['A_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values     77.504315
B_values    128.986127
C_values    118.977186
dtype: float64
Index(['A_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values    132.535352
B_values    150.886936
C_values     94.279492
dtype: float64
Index(['B_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values    150.554314
B_values    114.113724
C_values     92.487276
dtype: float64
Index(['B_values', 'C_values'], dtype='object')
--------------------------------------------------

Method 2:

def grpProc(grp):
    err = mape(grp[preds], grp[['target']])
    print(err)
    cols = err.nsmallest(k).index
    print(cols)
    print('-'*50)

df.groupby(pd.Grouper(freq='M')).apply(grpProc)

Out:

A_values     54.685258
B_values    212.458242
C_values    161.332752
dtype: float64
Index(['A_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values     77.504315
B_values    128.986127
C_values    118.977186
dtype: float64
Index(['A_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values    132.535352
B_values    150.886936
C_values     94.279492
dtype: float64
Index(['C_values', 'A_values'], dtype='object')
--------------------------------------------------
A_values    150.554314
B_values    114.113724
C_values     92.487276
dtype: float64
Index(['C_values', 'B_values'], dtype='object')
--------------------------------------------------

As you can see, method 1 gives wrong 2 lowest values for the third group, the correct should be: ['C_values', 'A_values'].

A_values    132.535352
B_values    150.886936
C_values     94.279492
dtype: float64
Index(['B_values', 'C_values'], dtype='object')

How could make it correct if we use np.argsort instead of pd.nsmallest? Thanks.

EDIT:

def grpProc(grp):
    err = mape(grp[preds], grp[['target']])
    print(err)
    # sort_args = np.argsort(err, axis=1) < k # incorrect result
    # sort_args = np.argsort(err, axis=1)[:k] # correct result and order of values
    # sort_args  = np.argsort(err).head(k) # correct result and order of values
    sort_args = np.argsort(np.argsort(err, axis=1)) < k # correct result but incorrect order of values
    cols = preds[sort_args]
    print(cols)
    print('-'*50)

df.groupby(pd.Grouper(freq='M')).apply(grpProc)

Out:

A_values     54.685258
B_values    212.458242
C_values    161.332752
dtype: float64
Index(['A_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values     77.504315
B_values    128.986127
C_values    118.977186
dtype: float64
Index(['A_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values    132.535352
B_values    150.886936
C_values     94.279492
dtype: float64
Index(['A_values', 'C_values'], dtype='object')
--------------------------------------------------
A_values    150.554314
B_values    114.113724
C_values     92.487276
dtype: float64
Index(['B_values', 'C_values'], dtype='object')
--------------------------------------------------

CodePudding user response:

np.argsort is doing the position reindex

sort_args  = err.iloc[np.argsort(err)].head(2)

And you have the pandas argsort as well (same as numpy)

err.iloc[err.argsort()].head(2)

Update

sort_args = np.argsort(np.argsort(err, axis=1)) < k
  • Related