Say I have a sample dataframe like this, with val being a binary value (between 1 and 2 in this instance). I would like to eliminate outliers in val, changing them to be the same as the majority value.
df = pandas.DataFrame({'name':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'], 'val':[1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2]})
name val
0 A 1
1 A 2
2 A 2
3 A 2
4 B 2
5 B 1
6 B 1
7 B 1
8 C 1
9 C 1
10 C 2
11 C 2
I would like the values at indexes 0 and 4 to be corrected (to 2 and 1 respectively, here), as there is only one occurrence in each group, but C to be unaltered.
I think I could write a transform statement, but not sure how to go about it.
CodePudding user response:
As you wrote you only have two possible values, you can compare the count of each value:
def fix_outliers(sr):
cnt = sr.value_counts()
return sr if cnt.iloc[0] == cnt.iloc[1] else [cnt.index[0]]*len(sr)
out = df.groupby('name')['val'].transform(fix_outliers)
Output:
>>> out
0 2
1 2
2 2
3 2
4 1
5 1
6 1
7 1
8 1
9 1
10 2
11 2
Name: val, dtype: int64
CodePudding user response:
If you want to keep the value that occurs most times you can use mode to find this values, than you can check if the count
of mode
is equal to 1. In case it is not equal to 1 that means that has two or more values happen in the same frequency.
for name in df["name"].unique(): #find distinct names in df
if(df[(df["name"] == name)].mode()["val"].count() == 1): #check if mode is sized 1
most_common_value = df[(df["name"] == name)].mode()["val"][0] # find the mode
df.loc[df["name"] == name , "val"] = most_common_value # modify df to val be the mode
Output:
name val
0 A 2
1 A 2
2 A 2
3 A 2
4 B 1
5 B 1
6 B 1
7 B 1
8 C 1
9 C 1
10 C 2
11 C 2