Given rows which belong to multiple groups, assign each row to one group where each group has a limi-CodePudding

I'm looking for a generic psuedo code, numpy, pandas or sql solution to this problem.

Problem Statement: You have a table of unique items with a group column which can belong to 1 or more groups. Assign each row to only one group while trying to maximize the total size of all groups. BUT each group can not exceed the specified size. The total number of rows in the table will likely be greater than all group size limits combined so some rows will need to be discarded.

Example Input:
ID   Groups
(4, [s1, s2])
(5, [s1, s2])
(6, [s1])
(15, [s1])
(7, [s2])
(8, [s3])
(10, [s3])
(12, [s3])
(13, [s3])

Desired Output Example
Group 1 (size_limit=3): 5, 6, 15
Group 2 (size_limit=2): 4, 7
Group 3 (size_limit=2): 10, 8


Bad Output
Group 1 (size_limit=3): 4, 5, 6
Group 2 (size_limit=2): 7
Group 3 (size_limit=2): 10, 8

CodePudding user response：

Here’s what I’m thinking for a solution but not sure if it covers all cases

EXAMPLE INPUT 1:
Strategy 1 (size_limit=3): 4, 5, 6, 15
Strategy 2 (size_limit=2): 4, 5, 7
Strategy 3 (size_limit=2): 8, 10, 12, 13

or

ID  Strategies
(4,  [s1, s2])
(5,  [s1, s2])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])
(12, [s3])
(13, [s3])

Go through each strategy, if strategy is above limit, prioritize removal rows that belong to multiple columns
until limit reached

For Strategy 1 (s1)
Strategy 1 (size_limit=3): 5, 6, 15
Strategy 2 (size_limit=2): 4, 5, 7
Strategy 3 (size_limit=2): 8, 10, 12, 13

or

ID  Strategies
(4,  [s2])
(5,  [s1, s2])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])
(12, [s3])
(13, [s3])

Then remove other groups for strategy (s1)
Strategy 1 (size_limit=3): 5, 6, 15
Strategy 2 (size_limit=2): 4, 7
Strategy 3 (size_limit=2): 8, 10, 12, 13

or

ID  Strategies
(4,  [s2])
(5,  [s1])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])
(12, [s3])
(13, [s3])

For strategy (s2), skip already at limit

For strategy (s3)
Strategy 1 (size_limit=3): 5, 6, 15
Strategy 2 (size_limit=2): 4, 7
Strategy 3 (size_limit=2): 8, 10

or

ID  Strategies
(4,  [s2])
(5,  [s1])
(6,  [s1])
(15, [s1])
(7,  [s2])
(8,  [s3])
(10, [s3])











EXAMPLE 2:
Strategy 1 (size_limit=3): 4, 5, 6, 7, 8, 9
Strategy 2 (size_limit=2): 4, 5, 10, 11
Strategy 3 (size_limit=2): 4, 5, 6

or

ID  Strategies
(4,  [s1, s2, s3])
(5,  [s1, s2, s3])
(6,  [s1, s3])
(7,  [s1])
(8,  [s1])
(9,  [s1])
(10, [s2])
(11, [s2])

Strategy = Strategy 1 (s1)
Strategy 1 (size_limit=3): 7, 8, 9
Strategy 2 (size_limit=2): 4, 5, 10, 11
Strategy 3 (size_limit=2): 4, 5, 6

or

(4,  [s2, s3])
(5,  [s2, s3])
(6,  [s3])
(7,  [s1])
(8,  [s1])
(9,  [s1])
(10, [s2])
(11, [s2])

Strategy = Strategy 2 (s2)
Strategy 1 (size_limit=3): 7, 8, 9
Strategy 2 (size_limit=2): 10, 11
Strategy 3 (size_limit=2): 4, 5, 6

or

(4,  [s3])
(5,  [s3])
(6,  [s3])
(7,  [s1])
(8,  [s1])
(9,  [s1])
(10, [s2])
(11, [s2])

Strategy = Strategy 3 (s3)
Strategy 1 (size_limit=3): 7, 8, 9
Strategy 2 (size_limit=2): 10, 11
Strategy 3 (size_limit=2): 5, 6

CodePudding user response：

My idea for an approach was to place every row in every group, find the best group to make concrete, remove its best members from the other groups, remove its excess members that are over the limit, and repeat.

I thought that the best group to make concrete would be the one with the least ambiguous rows needed to reach the limit, that is, rows that can possibly belong to other groups.

In pseudocode:

While there are ambiguous rows:
    Of groups with ambiguous rows, pik group with least ambiguous rows under limit
    Sort rows in group by number of groups they belong to
    Divide list into keepers and excess, as determined by limit
    Remove groups from keeper rows and keeper rows from groups until all keepers are non-ambiguous.
    For excess set, remove current group from row. Remove excess set from group.
Chop off any remaining excess non-ambiguous rows

In Python:

import pandas as pd
from typing import List

class Group:
    def __init__(self, id: str, limit: int):
        self.id = id
        self.limit = limit
        self.rows = set()

    def add_row(self, row):
        self.rows.add(row)

    def ambiguous_count(self):
        return len([row for row in self.rows if row.is_ambiguous()])

    def excess_count(self):
        return len(self.rows) - self.limit

    def ambiguous_under_limit_count(self):
        rows_list = sorted(self.rows, key=lambda row: row.groups_count())
        keepers = rows_list[:self.limit]
        return(len([k for k in keepers if k.is_ambiguous]))

    def trim(self):
        rows_list = sorted(self.rows, key=lambda row: row.groups_count())
        keepers = rows_list[:self.limit]
        for row in keepers:
            if row.is_ambiguous():
                row.make_non_ambiguous(self)
        excess = rows_list[self.limit:]
        for row in excess:
            self.remove_row(row)
            row.remove_group(self)
        self.rows = set(keepers)

    def chop(self):
        self.rows = set(list(self.rows)[:self.limit])

    def remove_row(self, row):
        self.rows.remove(row)


class GroupCollection:
    def __init__(self, groups, data: pd.DataFrame):
        self.groups = list(groups)
        self.groups_index = {group.id: group for group in groups}
        self.rows = set([Row(self, *row) for row in data.to_records(index=False)])

    def has_ambiguous_rows(self):
        for row in self.rows:
            if row.is_ambiguous():
                return True
        return False

    def group_to_trim(self):
        groups = [group for group in self.groups if group.ambiguous_count() > 0]
        if groups:
            return min(groups, key=lambda g: g.ambiguous_under_limit_count())
        else:
            return None

    def chop_all(self):
        for group in self.groups:
            group.chop()

    def optimize(self):
        while self.has_ambiguous_rows():
            if (trim_group := self.group_to_trim()):
                trim_group.trim()
        self.chop_all()

    def __str__(self):
        string = ""
        for group in self.groups:
            string = string   f"Group {group.id} (size_limit={group.limit}) "   ", ".join(str(row.id) for row in group.rows)   "\n"
        return string

class Row:
    def __init__(self, collection: GroupCollection, id: int, group_strings: List[str]):
        self.id = id
        self.collection = collection
        self.belongs_to = {collection.groups_index[key[1:]] for key in group_strings}
        for group in self.belongs_to:
            group.add_row(self)

    def is_ambiguous(self):
        return len(self.belongs_to) > 1

    def groups_count(self):
        return len(self.belongs_to)

    def make_non_ambiguous(self, group):
        remove_from = self.belongs_to - {group}
        for removal_group in remove_from:
            removal_group.remove_row(self)
        self.belongs_to = {group}

    def remove_group(self, group):
        self.belongs_to.remove(group)


data = pd.DataFrame({'ID': [4, 5, 6, 15, 7, 8, 10, 12, 13], 'Groups': [['s1', 's2'], ['s1', 's2'], ['s1'], ['s1'], ['s2'], ['s3'], ['s3'], ['s3'], ['s3']]})
all_groups = [Group('1', 3), Group('2', 2), Group('3', 2)]

gc = GroupCollection(all_groups, data)
gc.optimize()
print(gc)

data = pd.DataFrame({'ID': [1, 2], 'Groups': [['s1', 's2'], ['s1', 's2']]})
all_groups = [Group('1', 2), Group('2', 2)]

gc = GroupCollection(all_groups, data)
gc.optimize()
print(gc)

Output:

Group 1 (size_limit=3) 4, 15, 6
Group 2 (size_limit=2) 5, 7
Group 3 (size_limit=2) 8, 12

Group 1 (size_limit=2) 1, 2
Group 2 (size_limit=2)

This could be optimized better and may have some edge cases I haven't thought of.