I have a dataframe like this:
df = pd.DataFrame({
'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'],
'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0]
})
And I want to create a new column in which each value represent the most frequent value in the svm_prediction column and just for the specific id_vector group. For example if for id_vector 'A' the most frequent value in svm_prediction is 1, the new column will have all values 1, same for 0, in case the frequency is equal for both values then put 2 or in each row.
The result will be something like this where column new_label
is added to df
.
result = pd.DataFrame({
'id_vector': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E', 'E', 'E'],
'svm_prediction': [0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0],
'new_label': [0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2]
})
I tried:
r.groupby('id_vector')['svm_prediction'].mode()
but an error is raise because groupby has no attribute mode
also:
r['average'] = r[['svm_prediction']].mode(axis=0)
gives NaN values for every row except the first one.
CodePudding user response:
We can use np.where
to match the case you want, and np.mean
which will give 0.5
if zero's == one's and more than 0.5
if one's more and less than 0.5
for zero's more.
df['new_label'] = df.groupby('id_vector').svm_prediction.transform(np.mean)
df['new_label'] = np.where(df.new_label > 0.5 , 1, np.where(df.new_label < 0.5, 0, 2))
- output
id_vector true_label svm_prediction new_label
0 A 0 0 0
1 A 0 0 0
2 A 0 1 0
3 A 0 1 0
4 A 0 0 0
5 B 0 0 2
6 B 0 1 2
7 B 0 0 2
8 B 0 1 2
9 C 0 1 1
10 C 0 0 1
11 C 0 1 1
12 C 0 1 1
13 D 1 1 0
14 D 1 0 0
15 D 1 0 0
16 D 1 0 0
17 E 1 0 2
18 E 1 1 2
19 E 1 1 2
20 E 1 0 2
CodePudding user response:
Since the output you want has the same length as the original dataframe, groupby.transform
seems more appropriate. Then you can pass a lambda into it that uses a ternary operator to assign the mode if it's a singleton and the length otherwise.
mode_f = lambda x: m[0] if len(m:=x.mode()) == 1 else len(m)
df['new_label'] = df.groupby('id_vector')['svm_prediction'].transform(mode_f)
CodePudding user response:
well i think you are looking for this.
result_final = df.groupby('id_vector')['svm_prediction'].agg(pd.Series.mode)
print(result_final)
'''
id_vector
A 0
B [0, 1] #probably both repeated same time
C 1
D 0
E [0, 1]
Name: svm_prediction, dtype: object
'''