I'm having a bit of trouble finding outliers in a df based on groups and dates.
For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).
However I'm having some trouble filtering the data and isolating the outliers and replacing them.
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r
print (df)
group date val
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 10
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 20
Thanks for the help!
CodePudding user response:
First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
Then you can determine the median of each group:
medians = df.groupby('group')['value'].median()
Finally, locate the outliers and replace with the medians:
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()
All together it looks like:
import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
'2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values
Output:
group date value
0 A 2022-06-28 1
1 A 2022-06-28 2
2 A 2022-06-28 1
3 A 2022-06-27 2
4 A 2022-06-27 3
5 A 2022-06-27 2
6 B 2022-06-28 2
7 B 2022-06-28 3
8 B 2022-06-28 2
9 B 2022-06-27 3
10 B 2022-06-27 4
11 B 2022-06-27 3
CodePudding user response:
You can use a combination of .groupby/transform
to obtain the medians for each grouping, and then mask your original data against the outliers, filling with those medians.
medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)
print(df)
group date value new_value
0 A 2022-06-28 1 1.0
1 A 2022-06-28 2 2.0
2 A 2022-06-28 1 1.0
3 A 2022-06-27 2 2.0
4 A 2022-06-27 3 3.0
5 A 2022-06-27 10 2.0
6 B 2022-06-28 2 2.0
7 B 2022-06-28 3 3.0
8 B 2022-06-28 2 2.0
9 B 2022-06-27 3 3.0
10 B 2022-06-27 4 4.0
11 B 2022-06-27 20 3.0