Home > Software engineering >  Pandas finding and replacing outliers based on a group of two columns
Pandas finding and replacing outliers based on a group of two columns

Time:06-29

I'm having a bit of trouble finding outliers in a df based on groups and dates.

For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).

However I'm having some trouble filtering the data and isolating the outliers and replacing them.

 index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r

print (df)


     group        date  val
0      A  2022-06-28    1
1      A  2022-06-28    2
2      A  2022-06-28    1
3      A  2022-06-27    2
4      A  2022-06-27    3
5      A  2022-06-27   10
6      B  2022-06-28    2
7      B  2022-06-28    3
8      B  2022-06-28    2
9      B  2022-06-27    3
10     B  2022-06-27    4
11     B  2022-06-27   20

Thanks for the help!

CodePudding user response:

First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

Then you can determine the median of each group:

medians = df.groupby('group')['value'].median()

Finally, locate the outliers and replace with the medians:

df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

All together it looks like:

import pandas as pd
index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['value'] = r
outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
medians = df.groupby('group')['value'].median()
df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values

Output:

   group        date  value
0      A  2022-06-28      1
1      A  2022-06-28      2
2      A  2022-06-28      1
3      A  2022-06-27      2
4      A  2022-06-27      3
5      A  2022-06-27      2
6      B  2022-06-28      2
7      B  2022-06-28      3
8      B  2022-06-28      2
9      B  2022-06-27      3
10     B  2022-06-27      4
11     B  2022-06-27      3

CodePudding user response:

You can use a combination of .groupby/transform to obtain the medians for each grouping, and then mask your original data against the outliers, filling with those medians.

medians = df.groupby('group')['value'].transform('median')
df['new_value'] = df['value'].mask(lambda s: (s - s.mean()).abs() > s.std(), medians)

print(df)
   group        date  value  new_value
0      A  2022-06-28      1        1.0
1      A  2022-06-28      2        2.0
2      A  2022-06-28      1        1.0
3      A  2022-06-27      2        2.0
4      A  2022-06-27      3        3.0
5      A  2022-06-27     10        2.0
6      B  2022-06-28      2        2.0
7      B  2022-06-28      3        3.0
8      B  2022-06-28      2        2.0
9      B  2022-06-27      3        3.0
10     B  2022-06-27      4        4.0
11     B  2022-06-27     20        3.0
  • Related