I am having dataframe with 100 features and I want to winsorize outliers for each 'group'. You can use the following code to generate the dataframe.
import numpy as np
import pandas as pd
from scipy.stats import mstats
data = np.random.randint(1,999,size=(500,101))
cols = []
for i in range(101):
cols = [f'f_{i}']
df = pd.DataFrame(data, columns=cols)
df['group'] = np.random.randint(1,4,size=(500,1))
df = df.sort_values(by=['group'])
Now I want to winsorize (NOT delete !) extreme values for each feature in each group.
If you are not sure about 'winsorize'. Here is an example:
Before winsorize:
1, 2, 3, 4, 5 ... 97, 98, 99, 100
After winsorize the smallest and largest 1%:
2, 2, 3, 4, 5 ... 97, 98, 99, 99
I know how to winsorize extreme 1% values for each featrues for the entire dataframe by using the following code.
for col in df.columns:
df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])
However, I want to winsorize for each features for each group.
Can anyone please advise ? Thank you !
CodePudding user response:
There must be a more elegant way than this, but it seems to work for me and it's just a tiny addition to your solution:
for col in df.columns:
for group in df.group.unique():
df[col][df.group==group] = mstats.winsorize(df[col][df.group==group], limits=[0.01, 0.01])
As you can see, I also iterate through the groups in addition to the columns, and solve the problem with simple filtering of each column.