I have a DataFrame for students, each student represents by a binary vector for 6 different courses. i.e. if the student has registered for this course, 1 is will be put in the corresponding position otherwise it will be 0.
import pandas as pd
import numpy as np
df_student = pd.DataFrame({'Name':['tom', 'Howard', 'Monty', 'Sean', 'mat',
'john', 'peter', 'lina', 'rory', 'joe'],
'math':[1,0,0,1,0,1,1,1,1,1],
'physics':[1,0,0,1,0,0,1,0,1,0],
'chemistry':[0,1,1,1,0,1,0,1,1,1],
'biology':[1,0,0,1,0,1,1,1,0,1],
'history':[0,0,0,0,1,1,0,1,0,1],
'geography':[0,1,1,1,0,1,0,1,0,1]})
Which looks like:
Name math physics chemistry biology history geography
0 tom 1 1 0 1 0 0
1 Howard 0 0 1 0 0 1
2 Monty 0 0 1 0 0 1
3 Sean 1 1 1 1 0 1
4 mat 0 0 0 0 1 0
5 john 1 0 1 1 1 1
6 peter 1 1 0 1 0 0
7 lina 1 0 1 1 1 1
8 rory 1 1 1 0 0 0
9 joe 1 0 1 1 1 1
I want to cluster students into groups by applying some clustering algorithm with cosine similarity instead of euclidean distance
As a result, the students will be grouped, for example in k clusters, the expected output looks like this when we have 10 students :
cluster_0:{tom, peter}
cluster_1:{Howard, Monty}
cluster_2:{Sean}
cluster_3:{mat}
cluster_4:{john, lina, joe}
cluster_5:{rory}
CodePudding user response:
We could create a DataFrame from the outcome of cosine_similarity
; then mask
the values less than 1 (since there is some rounding error, we select a number very close to 1) and stack
the remaining values. Then the index of the stacked Series contains our desired clusters. To get them, we use groupby
agg(set)
drop_duplicates
. Then we create a dictionary from the clusters:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
df_student = df_student.set_index('Name')
df = pd.DataFrame(cosine_similarity(df_student),
index=df_student.index, columns=df_student.index)
clusters= (df.mask(df<1-np.finfo(float).eps)
.stack()
.index.to_frame()
.groupby(level=0)['Name'].agg(set)
.drop_duplicates())
clusters = (clusters.set_axis([f'cluster_{i}' for i in range(len(clusters))])
.to_dict())
Output:
{'cluster_0': {'Howard', 'Monty'},
'cluster_1': {'Sean'},
'cluster_2': {'joe', 'john', 'lina'},
'cluster_3': {'mat'},
'cluster_4': {'peter', 'tom'},
'cluster_5': {'rory'}}