Winsorizing on column with NaN does not change the max value-CodePudding

Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).

I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.

However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.

Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.

#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize

# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
 
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)

# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan

# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')

# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())

print()

print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())

CodePudding user response：

It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.

def winsorize_with_pandas(s, limits):
    """
    s : pd.Series
        Series to winsorize
    limits : tuple of float
        Tuple of the percentages to cut on each side of the array, 
        with respect to the number of unmasked data, as floats between 0. and 1
    """
    return s.clip(lower=s.quantile(limits[0], interpolation='lower'), 
                  upper=s.quantile(1-limits[1], interpolation='higher'))


winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0      3.0
1      3.0
2      3.0
3      4.0
4      5.0
5      6.0
6      7.0
7      8.0
8      9.0
9     10.0
10    11.0
11    12.0
12    13.0
13    14.0
14    15.0
15    16.0
16    17.0
17    18.0
18    18.0
19    18.0
Name: Age, dtype: float64

winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0      2.0
1      2.0
2      3.0
3      4.0
4      5.0
5      NaN
6      7.0
7      8.0
8      NaN
9     10.0
10    11.0
11    12.0
12    13.0
13    14.0
14    15.0
15    16.0
16    17.0
17    18.0
18    19.0
19    19.0
Name: Age, dtype: float64

CodePudding user response：

You can consider filling the missing values with the mean in the column, then winsorize and select only the original non nan

df2 = pd.DataFrame(data)

# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan

# mask of non nan
_m = df2['Age'].notna()
df2.loc[_m, 'Age'] = winsorize(df2['Age'].fillna(df2['Age'].mean()), limits=[0.1, 0.1])[_m]
print(df2['Age'].max())
print(df2['Age'].min())
# 18.0
# 3.0

or the other option by removing the nan before the winsorize.

df2.loc[_m, 'Age'] = winsorize(df2['Age'].loc[_m], limits=[0.1, 0.1])
print(df2['Age'].max())
print(df2['Age'].min())
# 19.0
# 2.0