Home > Software design >  Select mode of each group and assign back to the original dataframe on a condition
Select mode of each group and assign back to the original dataframe on a condition

Time:01-31

I have a dataframe like this:

df = pd.DataFrame({
    'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'], 
    'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0]
})

And I want to create a new column in which each value represent the most frequent value in the svm_prediction column and just for the specific id_vector group. For example if for id_vector 'A' the most frequent value in svm_prediction is 1, the new column will have all values 1, same for 0, in case the frequency is equal for both values then put 2 or in each row.

The result will be something like this where column new_label is added to df.

result = pd.DataFrame({
    'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'], 
    'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0], 
    'new_label': [0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2]
})

I tried:

r.groupby('id_vector')['svm_prediction'].mode()

but an error is raise because groupby has no attribute mode

also:

r['average'] = r[['svm_prediction']].mode(axis=0)

gives NaN values for every row except the first one.

CodePudding user response:

We can use np.where to match the case you want, and np.mean which will give 0.5 if zero's == one's and more than 0.5 if one's more and less than 0.5 for zero's more.

df['new_label'] = df.groupby('id_vector').svm_prediction.transform(np.mean)
df['new_label'] = np.where(df.new_label > 0.5 , 1, np.where(df.new_label < 0.5, 0, 2))

  • output
    id_vector   true_label  svm_prediction  new_label
0   A   0   0   0
1   A   0   0   0
2   A   0   1   0
3   A   0   1   0
4   A   0   0   0
5   B   0   0   2
6   B   0   1   2
7   B   0   0   2
8   B   0   1   2
9   C   0   1   1
10  C   0   0   1
11  C   0   1   1
12  C   0   1   1
13  D   1   1   0
14  D   1   0   0
15  D   1   0   0
16  D   1   0   0
17  E   1   0   2
18  E   1   1   2
19  E   1   1   2
20  E   1   0   2

CodePudding user response:

Since the output you want has the same length as the original dataframe, groupby.transform seems more appropriate. Then you can pass a lambda into it that uses a ternary operator to assign the mode if it's a singleton and the length otherwise.

mode_f = lambda x: m[0] if len(m:=x.mode()) == 1 else len(m)
df['new_label'] = df.groupby('id_vector')['svm_prediction'].transform(mode_f)

res

CodePudding user response:

well i think you are looking for this.

result_final = df.groupby('id_vector')['svm_prediction'].agg(pd.Series.mode)
print(result_final)
'''
id_vector
A         0
B    [0, 1] #probably both repeated same time
C         1
D         0
E    [0, 1]
Name: svm_prediction, dtype: object
'''
  • Related