Home > Mobile >  Pandas: winsorize feature outliers for each group
Pandas: winsorize feature outliers for each group

Time:03-03

I am having dataframe with 100 features and I want to winsorize outliers for each 'group'. You can use the following code to generate the dataframe.

import numpy as np
import pandas as pd
from scipy.stats import mstats

data = np.random.randint(1,999,size=(500,101))

cols = []
for i in range(101):
   cols  = [f'f_{i}']  

df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])

Now I want to winsorize (NOT delete !) extreme values for each feature in each group.

If you are not sure about 'winsorize'. Here is an example:

Before winsorize:

1, 2, 3, 4, 5 ... 97, 98, 99, 100

After winsorize the smallest and largest 1%:

2, 2, 3, 4, 5 ... 97, 98, 99, 99

I know how to winsorize extreme 1% values for each featrues for the entire dataframe by using the following code.

for col in df.columns:
    df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])

However, I want to winsorize for each features for each group.

Can anyone please advise ? Thank you !

CodePudding user response:

There must be a more elegant way than this, but it seems to work for me and it's just a tiny addition to your solution:

for col in df.columns:
    for group in df.group.unique():
        df[col][df.group==group] = mstats.winsorize(df[col][df.group==group], limits=[0.01, 0.01])

As you can see, I also iterate through the groups in addition to the columns, and solve the problem with simple filtering of each column.

  • Related