Home > OS >  Winsorize data with inf values
Winsorize data with inf values

Time:12-20

I have the following data:

{'Date': {0: Timestamp('2016-05-10 00:00:00'), 1: Timestamp('2016-05-11 00:00:00'), 2: Timestamp('2016-05-12 00:00:00'), 3: Timestamp('2016-05-13 00:00:00'), 4: Timestamp('2016-05-14 00:00:00'), 5: Timestamp('2016-05-15 00:00:00'), 6: Timestamp('2016-05-16 00:00:00'), 7: Timestamp('2016-05-17 00:00:00'), 8: Timestamp('2016-05-18 00:00:00'), 9: Timestamp('2016-05-19 00:00:00'), 10: Timestamp('2016-05-20 00:00:00'), 11: Timestamp('2016-05-21 00:00:00')}, 'variable': {0: 0.8491095654077778, 1: 0.40413794873666165, 2: 0.21376796260531072, 3: 0.07286348591835323, 4: 0.5721419081310215, 5: 0.35390011440489355, 6: 0.010530093815129149, 7: 0.013325613826142724, 8: 0.9207323422142153, 9: inf, 10: 0.5726367316365334, 11: 0.8351572257132013}}

I would like to winsorize it, so I tried:

df_inf['variable'] = winsorize(df_inf['variable'], limits=[0.05, 0.05], inclusive=(True, True), inplace=False, axis=None, nan_policy='omit')

However, the inf value is not replaced. Any ideas on how winsorizing can be applied to the inf value as well?

CodePudding user response:

You can't really handle infinite values in a quantile calculation.

I see two options.

You want to consider the infinite as NaN:

df_inf['variable'] = winsorize(df_inf['variable'].where(np.isfinite(df_inf['variable'])),
                               limits=[0.05, 0.05], inclusive=(True, True),
                               inplace=False, axis=None, nan_policy='omit')

Output:

         Date  variable
0  2016-05-10  0.849110
1  2016-05-11  0.404138
2  2016-05-12  0.213768
3  2016-05-13  0.072863
4  2016-05-14  0.572142
5  2016-05-15  0.353900
6  2016-05-16  0.010530
7  2016-05-17  0.013326
8  2016-05-18  0.920732
9  2016-05-19       NaN
10 2016-05-20  0.572637
11 2016-05-21  0.835157

You want to keep the Inf but assign an arbitrary high value:

df_inf['variable'] = winsorize(df_inf['variable'].clip(upper=1000),
                               limits=[0.05, 0.05], inclusive=(True, True),
                               inplace=False, axis=None, nan_policy='omit')

Output:

         Date     variable
0  2016-05-10     0.849110
1  2016-05-11     0.404138
2  2016-05-12     0.213768
3  2016-05-13     0.072863
4  2016-05-14     0.572142
5  2016-05-15     0.353900
6  2016-05-16     0.010530
7  2016-05-17     0.013326
8  2016-05-18     0.920732
9  2016-05-19  1000.000000
10 2016-05-20     0.572637
11 2016-05-21     0.835157

CodePudding user response:

You can use the built in function clip in pandas. I give you the general solution for even if you need to winsorize with groupby. This works well even with np.inf

Sample data:

df = pd.DataFrame({'name': ['a']*100 ['b']*100,
                   'val':np.random.randint(1,100,200)})
df.iloc[[0,1],1]=np.inf
df.iloc[[2,3],1]=-np.inf

Winsorize

# wpct
wpct = 0.01 # threshold to winsorize at 1% and 99%. Change to any value you like

# winsorize
df['valws'] = df.groupby('name')['val'].apply(lambda g: g.clip(lower=g.quantile(wpct), upper=g.quantile(1-wpct)))
  • Related