I have the following data:
{'Date': {0: Timestamp('2016-05-10 00:00:00'), 1: Timestamp('2016-05-11 00:00:00'), 2: Timestamp('2016-05-12 00:00:00'), 3: Timestamp('2016-05-13 00:00:00'), 4: Timestamp('2016-05-14 00:00:00'), 5: Timestamp('2016-05-15 00:00:00'), 6: Timestamp('2016-05-16 00:00:00'), 7: Timestamp('2016-05-17 00:00:00'), 8: Timestamp('2016-05-18 00:00:00'), 9: Timestamp('2016-05-19 00:00:00'), 10: Timestamp('2016-05-20 00:00:00'), 11: Timestamp('2016-05-21 00:00:00')}, 'variable': {0: 0.8491095654077778, 1: 0.40413794873666165, 2: 0.21376796260531072, 3: 0.07286348591835323, 4: 0.5721419081310215, 5: 0.35390011440489355, 6: 0.010530093815129149, 7: 0.013325613826142724, 8: 0.9207323422142153, 9: inf, 10: 0.5726367316365334, 11: 0.8351572257132013}}
I would like to winsorize it, so I tried:
df_inf['variable'] = winsorize(df_inf['variable'], limits=[0.05, 0.05], inclusive=(True, True), inplace=False, axis=None, nan_policy='omit')
However, the inf value is not replaced. Any ideas on how winsorizing can be applied to the inf value as well?
CodePudding user response:
You can't really handle infinite values in a quantile calculation.
I see two options.
You want to consider the infinite as NaN:
df_inf['variable'] = winsorize(df_inf['variable'].where(np.isfinite(df_inf['variable'])),
limits=[0.05, 0.05], inclusive=(True, True),
inplace=False, axis=None, nan_policy='omit')
Output:
Date variable
0 2016-05-10 0.849110
1 2016-05-11 0.404138
2 2016-05-12 0.213768
3 2016-05-13 0.072863
4 2016-05-14 0.572142
5 2016-05-15 0.353900
6 2016-05-16 0.010530
7 2016-05-17 0.013326
8 2016-05-18 0.920732
9 2016-05-19 NaN
10 2016-05-20 0.572637
11 2016-05-21 0.835157
You want to keep the Inf but assign an arbitrary high value:
df_inf['variable'] = winsorize(df_inf['variable'].clip(upper=1000),
limits=[0.05, 0.05], inclusive=(True, True),
inplace=False, axis=None, nan_policy='omit')
Output:
Date variable
0 2016-05-10 0.849110
1 2016-05-11 0.404138
2 2016-05-12 0.213768
3 2016-05-13 0.072863
4 2016-05-14 0.572142
5 2016-05-15 0.353900
6 2016-05-16 0.010530
7 2016-05-17 0.013326
8 2016-05-18 0.920732
9 2016-05-19 1000.000000
10 2016-05-20 0.572637
11 2016-05-21 0.835157
CodePudding user response:
You can use the built in function clip
in pandas. I give you the general solution for even if you need to winsorize with groupby
. This works well even with np.inf
Sample data:
df = pd.DataFrame({'name': ['a']*100 ['b']*100,
'val':np.random.randint(1,100,200)})
df.iloc[[0,1],1]=np.inf
df.iloc[[2,3],1]=-np.inf
Winsorize
# wpct
wpct = 0.01 # threshold to winsorize at 1% and 99%. Change to any value you like
# winsorize
df['valws'] = df.groupby('name')['val'].apply(lambda g: g.clip(lower=g.quantile(wpct), upper=g.quantile(1-wpct)))