How to change values with only one occurrence? (pandas)-CodePudding

Say I have a sample dataframe like this, with val being a binary value (between 1 and 2 in this instance). I would like to eliminate outliers in val, changing them to be the same as the majority value.

df = pandas.DataFrame({'name':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'], 'val':[1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2]})

   name  val
0     A    1
1     A    2
2     A    2
3     A    2
4     B    2
5     B    1
6     B    1
7     B    1
8     C    1
9     C    1
10    C    2
11    C    2

I would like the values at indexes 0 and 4 to be corrected (to 2 and 1 respectively, here), as there is only one occurrence in each group, but C to be unaltered.

I think I could write a transform statement, but not sure how to go about it.

CodePudding user response：

As you wrote you only have two possible values, you can compare the count of each value:

def fix_outliers(sr):
    cnt = sr.value_counts()
    return sr if cnt.iloc[0] == cnt.iloc[1] else [cnt.index[0]]*len(sr)

out = df.groupby('name')['val'].transform(fix_outliers)

Output:

>>> out
0     2
1     2
2     2
3     2
4     1
5     1
6     1
7     1
8     1
9     1
10    2
11    2
Name: val, dtype: int64

CodePudding user response：

If you want to keep the value that occurs most times you can use mode to find this values, than you can check if the count of mode is equal to 1. In case it is not equal to 1 that means that has two or more values happen in the same frequency.

for name in df["name"].unique(): #find distinct names in df
    if(df[(df["name"] == name)].mode()["val"].count() == 1): #check if mode is sized 1
        most_common_value = df[(df["name"] == name)].mode()["val"][0] # find the mode
        df.loc[df["name"] == name , "val"] = most_common_value # modify df to val be the mode

Output: