I have a set of Students with a list of skills they want to learn and set of teachers with a list of skills they are ready to teach.
Based on this information I have the below given tables. One for the Students and one for the Teachers. '1' represents a skill a student is willing to learn and the teacher is willing to teach. '0' means the opposite.
| Students | Skill 1 | Skill 2 | Skill 3 | Skill 4 | Skill 5 |
|------------|-----------|---- ------|----------|----------|-----------|
| A | 1 | 0 | 0 | 1 | 0 |
| B | 1 | 1 | 0 | 0 | 1 |
| C | 0 | 0 | 1 | 1 | 0 |
| D | 1 | 1 | 0 | 1 | 1 |
| E | 0 | 1 | 1 | 0 | 1 |
| Teachers | Skill 1 | Skill 2 | Skill 3 | Skill 4 | Skill 5 |
|------------|-----------|---- ------|----------|----------|-----------|
| F | 1 | 1 | 1 | 1 | 1 |
| G | 0 | 1 | 0 | 0 | 0 |
| H | 0 | 0 | 1 | 1 | 1 |
| I | 1 | 1 | 0 | 0 | 0 |
| J | 0 | 0 | 1 | 0 | 1 |
I am trying to match the Teachers with the appropriate Students and one suggestion I can see is to use the Jaccard Index. However, I am not sure if the Jaccard index works correctly on the Binary data.
I tried to use it on a small dataset as per below but I am not getting the correct results.
import numpy as np
a = [0, 1, 1, 0, 1, 0, 0]
b = [0, 1, 1, 0, 1, 0, 0]
#define Jaccard Similarity function
def jaccard(list1, list2):
intersection = len(list(set(list1).intersection(list2)))
union = (len(list1) len(list2)) - intersection
return float(intersection) / union
#find Jaccard Similarity between the two sets
jaccard(a, b)
0.16666 is the output even though the binary lists are exactly the same.
Any suggestions on how to correctly use the Jaccard Index in this case or any other way to match the teachers to the students? Thanks!
CodePudding user response:
If I understand correctly, you want to compute the maximum skill overlap using the Jaccard index and assign the "best" teacher to each student.
The first step is to compute a matrix of Jaccard indices:
S = (df1.melt(id_vars='Students')
.query('value==1')
.groupby('Students')['variable']
.agg(frozenset)
)
T = (df2.melt(id_vars='Teachers')
.query('value==1')
.groupby('Teachers')['variable']
.agg(frozenset)
)
def jaccard(s1, s2):
return len(s1&s2)/len(s1|s2)
from itertools import product
df = (pd
.Series({(s,t): jaccard(S[s], T[t]) for s,t in product(S.index, T.index)})
.unstack()
.rename_axis(index='student', columns='teacher')
)
# df
teacher A B C D E
student
A 0.4 0.000000 0.250000 0.333333 0.000000
B 0.6 0.333333 0.200000 0.666667 0.250000
C 0.4 0.000000 0.666667 0.000000 0.333333
D 0.8 0.250000 0.400000 0.500000 0.200000
E 0.6 0.333333 0.500000 0.250000 0.666667
Then, we can solve the assignment problem using scipy.optimize.linear_sum_assignment
:
from scipy.optimize import linear_sum_assignment
x, y = linear_sum_assignment(df, maximize=True)
out = pd.DataFrame({'student': df.columns[y], 'teacher': df.index[x]})
# out
student teacher
0 B A
1 D B
2 C C
3 A D
4 E E
Alternatively, if you just want the best teacher for each student, even if this means potentially having teachers without students and others with many students, use idxmax
:
df.idxmax(axis=1)
student
A A
B D
C C
D A
E E
dtype: object