Select mode of each group and assign back to the original dataframe on a condition-CodePudding

I have a dataframe like this:

df = pd.DataFrame({
    'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'], 
    'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0]
})

And I want to create a new column in which each value represent the most frequent value in the svm_prediction column and just for the specific id_vector group. For example if for id_vector 'A' the most frequent value in svm_prediction is 1, the new column will have all values 1, same for 0, in case the frequency is equal for both values then put 2 or in each row.

The result will be something like this where column new_label is added to df.

result = pd.DataFrame({
    'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'], 
    'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0], 
    'new_label': [0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2]
})

I tried:

r.groupby('id_vector')['svm_prediction'].mode()

but an error is raise because groupby has no attribute mode

also:

r['average'] = r[['svm_prediction']].mode(axis=0)

gives NaN values for every row except the first one.

CodePudding user response：

We can use np.where to match the case you want, and np.mean which will give 0.5 if zero's == one's and more than 0.5 if one's more and less than 0.5 for zero's more.

df['new_label'] = df.groupby('id_vector').svm_prediction.transform(np.mean)
df['new_label'] = np.where(df.new_label > 0.5 , 1, np.where(df.new_label < 0.5, 0, 2))

output

    id_vector   true_label  svm_prediction  new_label
0   A   0   0   0
1   A   0   0   0
2   A   0   1   0
3   A   0   1   0
4   A   0   0   0
5   B   0   0   2
6   B   0   1   2
7   B   0   0   2
8   B   0   1   2
9   C   0   1   1
10  C   0   0   1
11  C   0   1   1
12  C   0   1   1
13  D   1   1   0
14  D   1   0   0
15  D   1   0   0
16  D   1   0   0
17  E   1   0   2
18  E   1   1   2
19  E   1   1   2
20  E   1   0   2

CodePudding user response：

Since the output you want has the same length as the original dataframe, groupby.transform seems more appropriate. Then you can pass a lambda into it that uses a ternary operator to assign the mode if it's a singleton and the length otherwise.

mode_f = lambda x: m[0] if len(m:=x.mode()) == 1 else len(m)
df['new_label'] = df.groupby('id_vector')['svm_prediction'].transform(mode_f)

CodePudding user response：

well i think you are looking for this.

result_final = df.groupby('id_vector')['svm_prediction'].agg(pd.Series.mode)
print(result_final)
'''
id_vector
A         0
B    [0, 1] #probably both repeated same time
C         1
D         0
E    [0, 1]
Name: svm_prediction, dtype: object
'''