I have a list of students with two properties, in this case gender and class. How can I create 6 groups of 3 students (and 1 of 2) with maximum class and gender diversity in the groups? Students need to be assigned so that the chances of ending up in a group with somebody of the same gender or class are minimal. I.e. I want to mix up the groups as good as possible so students with different backgrounds intermingle.
import pandas as pd
data = [{'name': 'Katina', 'gender': 'f', 'class': 'a'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Manuel', 'gender': 'm', 'class': 'b'}, {'name': 'Timothy', 'gender': 'm', 'class': 'b'}, {'name': 'Eunice', 'gender': 'f', 'class': 'a'}, {'name': 'Gary', 'gender': 'm', 'class': 'a'}, {'name': 'Michael', 'gender': 'm', 'class': 'a'}, {'name': 'Melanie', 'gender': 'f', 'class': 'b'}, {'name': 'Edward', 'gender': 'm', 'class': 'a'}, {'name': 'Jessie', 'gender': 'm', 'class': 'b'}, {'name': 'Elizabeth', 'gender': 'f', 'class': 'a'}, {'name': 'Jared', 'gender': 'm', 'class': 'a'}, {'name': 'Jose', 'gender': 'm', 'class': 'b'}, {'name': 'Amado', 'gender': 'm', 'class': 'b'}, {'name': 'Matthew', 'gender': 'm', 'class': 'a'}, {'name': 'Charlie', 'gender': 'm', 'class': 'a'}, {'name': 'Willie', 'gender': 'm', 'class': 'b'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Susan', 'gender': 'f', 'class': 'b'}, {'name': 'Michael', 'gender': 'm', 'class': 'b'}]
df = pd.DataFrame(data)
The best thing I could come up with was to first apply clustering and then traverse the clusters to assign students to groups, but this basically only reduces the number of properties and doesn't solve the assignment issue. Any ideas?
from sklearn.cluster import KMeans
df['gender']= df['gender'].map({'f':1, 'm':2})
df['class'] = df['class'].map({'a':1, 'b':2})
kmeans = KMeans(n_clusters=10)
df['cluster'] = kmeans.fit_predict(df[['gender', 'class']])
CodePudding user response:
The closest problem I can think of is set packing, which is NP-complete, so it's probably safe to say there's no fast algorithm for doing this.
One observation here is that all of the variables you're interested in are categorical, so we have an easy way to define the diversity of a group: the number of times two entities in the group share the same value for a given variable. To start, let's define a few things.
A student can be a struct-like object
or a dict
; let's go with the latter. A student can therefore just look like
pat = {
'name': 'Pat',
'gender': 'nonbinary',
'class': 'stupid rich'
}
You might want to use enums for the values, but let's keep it simple. Each student has two traits ("variables" in the statistical sense), and each of those can assume any of a number of string values. Let's make sure we can keep track of those traits somewhere, maybe a private global variable:
_TRAITS = ['gender', 'class']
A group can just be a list of students.
Now we can define a function diversity_score
, which accepts a group and returns an integer (higher is more diverse):
def diversity_score(group):
score = 0
for i, student_1 in enumerate(group):
for student_2 in group[i 1:]:
for trait in _TRAITS:
score -= (student_1[trait] == student_2[trait])
return score
This can probably be done more efficiently with Pandas, but you get the point: every time the same trait is shared between two students, the diversity score goes down. This has the nice property that larger groups of homogeneous students will get quadratically penalized due to the number of connections in a fully-connected graph (two students with the same gender will reduce the score by one; three students by three; four students by 6; etc.). Therefore, a group with four women and four men will be more diverse than five women and three men; more diverse still than six and two, etc.
Now we can define a greedy algorithm. We'll start by randomly assigning all students to groups of relatively equal size. We'll then begin an iterative loop. On each iteration, we'll go through every student, and try swapping them to different groups. If we find that the overall diversity score improves (i.e., swapping students A and B from groups X and Y results in the sum of the X and Y's diversity scores increasing), we make the swap.
In code, that looks like this. First, the initial assignment:
import random
def random_assignment(students, num_groups):
"""Randomly assign students to groups.
students: list of students, as described above
num_groups: number of groups to return
returns: lists of students of more-or-less equal size
"""
random.shuffle(students)
groups = [[] for _ in range(num_groups)]
for i, student in enumerate(students):
index = i % len(groups)
groups[index].append(student)
return groups
Now for the iterative step, where we go through each group and try swapping students with other groups. First, let's define the helper function maybe_swap
, which will try swapping students between groups. If it finds a way to improve the diversity score, it will perform the swap in-place and return True
; otherwise, it will return False
:
def maybe_swap(group_1, group_2):
diversity_1, diversity_2 = map(diversity_score, [group_1, group_2])
old_sum = diversity_1 diversity_2
for _ in range(len(group_1)): # do this for every student in group_1
student_1 = group_1.pop(0) # provisionally remove from group_1
for _ in range(len(group_2)):
student_2 = group_2.pop(0) # provisionally remove from g_2
group_1.append(student_2) # provisionally swap students
group_2.append(student_1)
diversity_1, diversity_2 = map(
diversity_score, [group_1, group_2])
new_sum = diversity_1 diversity_2
if new_sum > old_sum: # see what the new score is
return True # leave the swap intact and return if higher
group_2.pop() # else, remove student_1 from group_2 ...
group_2.append(group_1.pop()) # and return student_2 to group_2
group_1.append(student_1) # return student_1 to group_1
return False
With the helper functions initial_assignment
and maybe_swap
defined, we can create the main loop:
def assign_to_groups(students, num_groups):
groups = random_assignment(students, num_groups)
while True:
has_swapped = False
for i, group_1 in enumerate(groups):
for group_2 in groups[i 1:]:
has_swapped = maybe_swap(group_1, group_2) or has_swapped
if not has_swapped:
return groups
This function is guaranteed to return because the diversity score is only permitted to increase (otherwise we might run into cycles), and the diversity score is capped at 0.
There's a lot to improve here. We could cache each group's diversity score. We could find more efficient ways to calculate diversity using data frames. We could iterate over lists using indices, cutting down on the number of slices.
But on the plus side, what we have here is an easy-to-understand algorithm in pure Python. It requires no clustering, just basic set-theoretic reasoning. Hope it helps!
Edit:
Putting it all together with the data you provided:
import random
data = [{'name': 'Katina', 'gender': 'f', 'class': 'a'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Manuel', 'gender': 'm', 'class': 'b'}, {'name': 'Timothy', 'gender': 'm', 'class': 'b'}, {'name': 'Eunice', 'gender': 'f', 'class': 'a'}, {'name': 'Gary', 'gender': 'm', 'class': 'a'}, {'name': 'Michael', 'gender': 'm', 'class': 'a'}, {'name': 'Melanie', 'gender': 'f', 'class': 'b'}, {'name': 'Edward', 'gender': 'm', 'class': 'a'}, {'name': 'Jessie', 'gender': 'm', 'class': 'b'}, {'name': 'Elizabeth', 'gender': 'f', 'class': 'a'}, {'name': 'Jared', 'gender': 'm', 'class': 'a'}, {'name': 'Jose', 'gender': 'm', 'class': 'b'}, {'name': 'Amado', 'gender': 'm', 'class': 'b'}, {'name': 'Matthew', 'gender': 'm', 'class': 'a'}, {'name': 'Charlie', 'gender': 'm', 'class': 'a'}, {'name': 'Willie', 'gender': 'm', 'class': 'b'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Susan', 'gender': 'f', 'class': 'b'}, {'name': 'Michael', 'gender': 'm', 'class': 'b'}]
_TRAITS = ['gender', 'class']
def diversity_score(group):
score = 0
for i, student_1 in enumerate(group):
for student_2 in group[i 1:]:
for trait in _TRAITS:
score -= (student_1[trait] == student_2[trait])
return score
def random_assignment(students, num_groups):
"""Randomly assign students to groups.
students: list of students, as described above
num_groups: number of groups to return
returns: lists of students of more-or-less equal size
"""
random.shuffle(students)
groups = [[] for _ in range(num_groups)]
for i, student in enumerate(students):
index = i % len(groups)
groups[index].append(student)
return groups
def maybe_swap(group_1, group_2):
diversity_1, diversity_2 = map(diversity_score, [group_1, group_2])
old_sum = diversity_1 diversity_2
for _ in range(len(group_1)): # do this for every student in group_1
student_1 = group_1.pop(0) # provisionally remove from group_1
for _ in range(len(group_2)):
student_2 = group_2.pop(0) # provisionally remove from g_2
group_1.append(student_2) # provisionally swap students
group_2.append(student_1)
diversity_1, diversity_2 = map(
diversity_score, [group_1, group_2])
new_sum = diversity_1 diversity_2
if new_sum > old_sum: # see what the new score is
return True # leave the swap intact and return if higher
group_2.pop() # else, remove student_1 from group_2 ...
group_2.append(group_1.pop()) # and return student_2 to group_2
group_1.append(student_1) # return student_1 to group_1
return False
def assign_to_groups(students, num_groups):
groups = random_assignment(students, num_groups)
while True:
has_swapped = False
for i, group_1 in enumerate(groups):
for group_2 in groups[i 1:]:
has_swapped = maybe_swap(group_1, group_2) or has_swapped
if not has_swapped:
return groups
Now, if I call assign_to_groups(data, 5)
, I get the output
group 0: [{'name': 'Manuel', 'gender': 'm', 'class': 'b'}, {'name': 'Jared', 'gender': 'm', 'class': 'a'}, {'name': 'Michael', 'gender': 'm', 'class': 'b'}, {'name': 'Eunice', 'gender': 'f', 'class': 'a'}] (diversity score: -5)
group 1: [{'name': 'Gary', 'gender': 'm', 'class': 'a'}, {'name': 'Elizabeth', 'gender': 'f', 'class': 'a'}, {'name': 'Willie', 'gender': 'm', 'class': 'b'}, {'name': 'Jose', 'gender': 'm', 'class': 'b'}] (diversity score: -5)
group 2: [{'name': 'Jessie', 'gender': 'm', 'class': 'b'}, {'name': 'Michael', 'gender': 'm', 'class': 'a'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Timothy', 'gender': 'm', 'class': 'b'}] (diversity score: -5)
group 3: [{'name': 'Matthew', 'gender': 'm', 'class': 'a'}, {'name': 'Susan', 'gender': 'f', 'class': 'b'}, {'name': 'Amado', 'gender': 'm', 'class': 'b'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}] (diversity score: -4)
group 4: [{'name': 'Katina', 'gender': 'f', 'class': 'a'}, {'name': 'Edward', 'gender': 'm', 'class': 'a'}, {'name': 'Charlie', 'gender': 'm', 'class': 'a'}, {'name': 'Melanie', 'gender': 'f', 'class': 'b'}] (diversity score: -5)