Create groups with maximum diversity of group members-CodePudding

I have a list of students with two properties, in this case gender and class. How can I create 6 groups of 3 students (and 1 of 2) with maximum class and gender diversity in the groups? Students need to be assigned so that the chances of ending up in a group with somebody of the same gender or class are minimal. I.e. I want to mix up the groups as good as possible so students with different backgrounds intermingle.

import pandas as pd 

data = [{'name': 'Katina', 'gender': 'f', 'class': 'a'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Manuel', 'gender': 'm', 'class': 'b'}, {'name': 'Timothy', 'gender': 'm', 'class': 'b'}, {'name': 'Eunice', 'gender': 'f', 'class': 'a'}, {'name': 'Gary', 'gender': 'm', 'class': 'a'}, {'name': 'Michael', 'gender': 'm', 'class': 'a'}, {'name': 'Melanie', 'gender': 'f', 'class': 'b'}, {'name': 'Edward', 'gender': 'm', 'class': 'a'}, {'name': 'Jessie', 'gender': 'm', 'class': 'b'}, {'name': 'Elizabeth', 'gender': 'f', 'class': 'a'}, {'name': 'Jared', 'gender': 'm', 'class': 'a'}, {'name': 'Jose', 'gender': 'm', 'class': 'b'}, {'name': 'Amado', 'gender': 'm', 'class': 'b'}, {'name': 'Matthew', 'gender': 'm', 'class': 'a'}, {'name': 'Charlie', 'gender': 'm', 'class': 'a'}, {'name': 'Willie', 'gender': 'm', 'class': 'b'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Susan', 'gender': 'f', 'class': 'b'}, {'name': 'Michael', 'gender': 'm', 'class': 'b'}]
df = pd.DataFrame(data)

The best thing I could come up with was to first apply clustering and then traverse the clusters to assign students to groups, but this basically only reduces the number of properties and doesn't solve the assignment issue. Any ideas?

from sklearn.cluster import KMeans

df['gender']= df['gender'].map({'f':1, 'm':2})
df['class'] = df['class'].map({'a':1, 'b':2})
kmeans = KMeans(n_clusters=10)

df['cluster'] = kmeans.fit_predict(df[['gender', 'class']])

CodePudding user response：

The closest problem I can think of is set packing, which is NP-complete, so it's probably safe to say there's no fast algorithm for doing this.

One observation here is that all of the variables you're interested in are categorical, so we have an easy way to define the diversity of a group: the number of times two entities in the group share the same value for a given variable. To start, let's define a few things.

A student can be a struct-like object or a dict; let's go with the latter. A student can therefore just look like

pat = {
  'name': 'Pat',
  'gender': 'nonbinary',
  'class': 'stupid rich'
}

You might want to use enums for the values, but let's keep it simple. Each student has two traits ("variables" in the statistical sense), and each of those can assume any of a number of string values. Let's make sure we can keep track of those traits somewhere, maybe a private global variable:

_TRAITS = ['gender', 'class']

A group can just be a list of students.

Now we can define a function diversity_score, which accepts a group and returns an integer (higher is more diverse):

def diversity_score(group):
    score = 0
    for i, student_1 in enumerate(group):
        for student_2 in group[i   1:]:
            for trait in _TRAITS:
                score -= (student_1[trait] == student_2[trait])
    return score

This can probably be done more efficiently with Pandas, but you get the point: every time the same trait is shared between two students, the diversity score goes down. This has the nice property that larger groups of homogeneous students will get quadratically penalized due to the number of connections in a fully-connected graph (two students with the same gender will reduce the score by one; three students by three; four students by 6; etc.). Therefore, a group with four women and four men will be more diverse than five women and three men; more diverse still than six and two, etc.

Now we can define a greedy algorithm. We'll start by randomly assigning all students to groups of relatively equal size. We'll then begin an iterative loop. On each iteration, we'll go through every student, and try swapping them to different groups. If we find that the overall diversity score improves (i.e., swapping students A and B from groups X and Y results in the sum of the X and Y's diversity scores increasing), we make the swap.

In code, that looks like this. First, the initial assignment:

import random

def random_assignment(students, num_groups):
    """Randomly assign students to groups.

    students: list of students, as described above
    num_groups: number of groups to return
    returns: lists of students of more-or-less equal size
    """
    random.shuffle(students)
    groups = [[] for _ in range(num_groups)]
    for i, student in enumerate(students):
        index = i % len(groups)
        groups[index].append(student)
    return groups

Now for the iterative step, where we go through each group and try swapping students with other groups. First, let's define the helper function maybe_swap, which will try swapping students between groups. If it finds a way to improve the diversity score, it will perform the swap in-place and return True; otherwise, it will return False:

def maybe_swap(group_1, group_2):
    diversity_1, diversity_2 = map(diversity_score, [group_1, group_2])
    old_sum = diversity_1   diversity_2
    for _ in range(len(group_1)):  # do this for every student in group_1
        student_1 = group_1.pop(0)  # provisionally remove from group_1
        for _ in range(len(group_2)):
            student_2 = group_2.pop(0)  # provisionally remove from g_2
            group_1.append(student_2)  # provisionally swap students
            group_2.append(student_1)
            diversity_1, diversity_2 = map(
                diversity_score, [group_1, group_2])
            new_sum = diversity_1   diversity_2
            if new_sum > old_sum:  # see what the new score is
                return True  # leave the swap intact and return if higher
            group_2.pop()  # else, remove student_1 from group_2 ...
            group_2.append(group_1.pop())  # and return student_2 to group_2
        group_1.append(student_1)  # return student_1 to group_1
    return False

With the helper functions initial_assignment and maybe_swap defined, we can create the main loop:

def assign_to_groups(students, num_groups):
    groups = random_assignment(students, num_groups)
    while True:
        has_swapped = False
        for i, group_1 in enumerate(groups):
            for group_2 in groups[i   1:]:
                has_swapped = maybe_swap(group_1, group_2) or has_swapped
        if not has_swapped:
            return groups

This function is guaranteed to return because the diversity score is only permitted to increase (otherwise we might run into cycles), and the diversity score is capped at 0.

There's a lot to improve here. We could cache each group's diversity score. We could find more efficient ways to calculate diversity using data frames. We could iterate over lists using indices, cutting down on the number of slices.

But on the plus side, what we have here is an easy-to-understand algorithm in pure Python. It requires no clustering, just basic set-theoretic reasoning. Hope it helps!

Edit:

Putting it all together with the data you provided:

import random


data = [{'name': 'Katina', 'gender': 'f', 'class': 'a'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Manuel', 'gender': 'm', 'class': 'b'}, {'name': 'Timothy', 'gender': 'm', 'class': 'b'}, {'name': 'Eunice', 'gender': 'f', 'class': 'a'}, {'name': 'Gary', 'gender': 'm', 'class': 'a'}, {'name': 'Michael', 'gender': 'm', 'class': 'a'}, {'name': 'Melanie', 'gender': 'f', 'class': 'b'}, {'name': 'Edward', 'gender': 'm', 'class': 'a'}, {'name': 'Jessie', 'gender': 'm', 'class': 'b'}, {'name': 'Elizabeth', 'gender': 'f', 'class': 'a'}, {'name': 'Jared', 'gender': 'm', 'class': 'a'}, {'name': 'Jose', 'gender': 'm', 'class': 'b'}, {'name': 'Amado', 'gender': 'm', 'class': 'b'}, {'name': 'Matthew', 'gender': 'm', 'class': 'a'}, {'name': 'Charlie', 'gender': 'm', 'class': 'a'}, {'name': 'Willie', 'gender': 'm', 'class': 'b'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Susan', 'gender': 'f', 'class': 'b'}, {'name': 'Michael', 'gender': 'm', 'class': 'b'}]


_TRAITS = ['gender', 'class']


def diversity_score(group):
    score = 0
    for i, student_1 in enumerate(group):
        for student_2 in group[i   1:]:
            for trait in _TRAITS:
                score -= (student_1[trait] == student_2[trait])
    return score


def random_assignment(students, num_groups):
    """Randomly assign students to groups.

    students: list of students, as described above
    num_groups: number of groups to return
    returns: lists of students of more-or-less equal size
    """
    random.shuffle(students)
    groups = [[] for _ in range(num_groups)]
    for i, student in enumerate(students):
        index = i % len(groups)
        groups[index].append(student)
    return groups


def maybe_swap(group_1, group_2):
    diversity_1, diversity_2 = map(diversity_score, [group_1, group_2])
    old_sum = diversity_1   diversity_2
    for _ in range(len(group_1)):  # do this for every student in group_1
        student_1 = group_1.pop(0)  # provisionally remove from group_1
        for _ in range(len(group_2)):
            student_2 = group_2.pop(0)  # provisionally remove from g_2
            group_1.append(student_2)  # provisionally swap students
            group_2.append(student_1)
            diversity_1, diversity_2 = map(
                diversity_score, [group_1, group_2])
            new_sum = diversity_1   diversity_2
            if new_sum > old_sum:  # see what the new score is
                return True  # leave the swap intact and return if higher
            group_2.pop()  # else, remove student_1 from group_2 ...
            group_2.append(group_1.pop())  # and return student_2 to group_2
        group_1.append(student_1)  # return student_1 to group_1
    return False


def assign_to_groups(students, num_groups):
    groups = random_assignment(students, num_groups)
    while True:
        has_swapped = False
        for i, group_1 in enumerate(groups):
            for group_2 in groups[i   1:]:
                has_swapped = maybe_swap(group_1, group_2) or has_swapped
        if not has_swapped:
            return groups

Now, if I call assign_to_groups(data, 5), I get the output

group 0: [{'name': 'Manuel', 'gender': 'm', 'class': 'b'}, {'name': 'Jared', 'gender': 'm', 'class': 'a'}, {'name': 'Michael', 'gender': 'm', 'class': 'b'}, {'name': 'Eunice', 'gender': 'f', 'class': 'a'}] (diversity score: -5)
group 1: [{'name': 'Gary', 'gender': 'm', 'class': 'a'}, {'name': 'Elizabeth', 'gender': 'f', 'class': 'a'}, {'name': 'Willie', 'gender': 'm', 'class': 'b'}, {'name': 'Jose', 'gender': 'm', 'class': 'b'}] (diversity score: -5)
group 2: [{'name': 'Jessie', 'gender': 'm', 'class': 'b'}, {'name': 'Michael', 'gender': 'm', 'class': 'a'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}, {'name': 'Timothy', 'gender': 'm', 'class': 'b'}] (diversity score: -5)
group 3: [{'name': 'Matthew', 'gender': 'm', 'class': 'a'}, {'name': 'Susan', 'gender': 'f', 'class': 'b'}, {'name': 'Amado', 'gender': 'm', 'class': 'b'}, {'name': 'Mary', 'gender': 'f', 'class': 'a'}] (diversity score: -4)
group 4: [{'name': 'Katina', 'gender': 'f', 'class': 'a'}, {'name': 'Edward', 'gender': 'm', 'class': 'a'}, {'name': 'Charlie', 'gender': 'm', 'class': 'a'}, {'name': 'Melanie', 'gender': 'f', 'class': 'b'}] (diversity score: -5)