cluster students into groups with cosine similarity-CodePudding

I have a DataFrame for students, each student represents by a binary vector for 6 different courses. i.e. if the student has registered for this course, 1 is will be put in the corresponding position otherwise it will be 0.

import pandas as pd
import numpy as np

df_student = pd.DataFrame({'Name':['tom', 'Howard', 'Monty', 'Sean', 'mat', 
                                   'john', 'peter', 'lina', 'rory', 'joe'],
                           'math':[1,0,0,1,0,1,1,1,1,1],
                           'physics':[1,0,0,1,0,0,1,0,1,0], 
                           'chemistry':[0,1,1,1,0,1,0,1,1,1],
                           'biology':[1,0,0,1,0,1,1,1,0,1],
                           'history':[0,0,0,0,1,1,0,1,0,1],
                           'geography':[0,1,1,1,0,1,0,1,0,1]})

Which looks like:

     Name  math  physics  chemistry  biology  history  geography
0     tom     1        1          0        1        0          0
1  Howard     0        0          1        0        0          1
2   Monty     0        0          1        0        0          1
3    Sean     1        1          1        1        0          1
4     mat     0        0          0        0        1          0
5    john     1        0          1        1        1          1
6   peter     1        1          0        1        0          0
7    lina     1        0          1        1        1          1
8    rory     1        1          1        0        0          0
9     joe     1        0          1        1        1          1

I want to cluster students into groups by applying some clustering algorithm with cosine similarity instead of euclidean distance

As a result, the students will be grouped, for example in k clusters, the expected output looks like this when we have 10 students :

cluster_0:{tom, peter}
cluster_1:{Howard, Monty}
cluster_2:{Sean}
cluster_3:{mat}
cluster_4:{john, lina, joe}
cluster_5:{rory}

CodePudding user response：

We could create a DataFrame from the outcome of cosine_similarity; then mask the values less than 1 (since there is some rounding error, we select a number very close to 1) and stack the remaining values. Then the index of the stacked Series contains our desired clusters. To get them, we use groupby agg(set) drop_duplicates. Then we create a dictionary from the clusters:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
df_student = df_student.set_index('Name')
df = pd.DataFrame(cosine_similarity(df_student), 
                  index=df_student.index, columns=df_student.index)
clusters= (df.mask(df<1-np.finfo(float).eps)
           .stack()
           .index.to_frame()
           .groupby(level=0)['Name'].agg(set)
           .drop_duplicates())
clusters = (clusters.set_axis([f'cluster_{i}' for i in range(len(clusters))])
            .to_dict())

Output:

{'cluster_0': {'Howard', 'Monty'},
 'cluster_1': {'Sean'},
 'cluster_2': {'joe', 'john', 'lina'},
 'cluster_3': {'mat'},
 'cluster_4': {'peter', 'tom'},
 'cluster_5': {'rory'}}