Home > Software design >  How to enumerate combinations filtering repeats
How to enumerate combinations filtering repeats

Time:09-17

I have a list of possible choices: [[1], [2, 4], [4], [5, 6, 2], [5, 3]].

I want to list all combinations, taking maximum one element from each sublist, without repeating elements.

So [1, 2, 4, 5, 3] is a valid option. But [1, 4, 4, 5, 3] is not. I allow not making a choice in any sublist, so [1,4, None,5,3] is valid, as in [1, None, None, None, None], and [None, None, None, None, None].

I can't simply enumerate all combinations then filter out the ones I don't want, since for a large list of possible choices, it would quickly become computationally infeasible (I'm looking at 25^25 maximum combinations in my project).

edit: I would also apply some additional criteria to the results, such as filtering to have no more than a threshold of None choices, or sorting the resultant list of combinations in order of combinations with fewest None choices.

edit: with details of the real-life case: I'd like to apply it to a list of 25 sublists, each of which can have 1-25 elements. Realisitically, each sublist will have max 15 elements, with 2-4 on average.

So the easy solution of list(itertools.product(*choices)) then filtering is out.

I may also wish to add other filter conditions to the list of combinations, so ideally I can filter these upfront.

I've tried building a tree recursively, where e.g. root node has the full list of choices, child node makes the first choice [1], and has an updated list of choices where '1' is removed from all list[1:] choices.

Struggling to implement the recursion though.

Can you help me with any other approaches?

CodePudding user response:

Another way to generate all valid outputs with minimal memory usage is to iterate over the elements rather than over the lists. Use a Depth-First search so that you only generate valid outputs from the start. This means that we need to track three things in each level of our DFS: the current element to maybe add, the lists that are already used, and the order that we used those previous lists in.

To help with our search, we preprocess choices by mapping every element to a set of the possible lists it can be in, which create a sort of Dual version of the problem. This also generates the lists in roughly 'maximum non-empty choices first' order.

Since you've specified that the number of unique elements, N, is equal to the number of lists, this approach runs in O(N * |output|) time, and we use generators everywhere to save memory.

import collections
from typing import Dict, Generator, List, Optional, Set

choices = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
element_to_containing_lists: Dict[int, Set[int]] = collections.defaultdict(set)

for i, choice_list in enumerate(choices):
    for x in choice_list:
        element_to_containing_lists[x].add(i)

all_unique_elements = sorted(element_to_containing_lists)


def dfs(used_list_indices: Set[int],
        next_element_index: int,
        used_list_ordered_indices: List[Optional[int]]) -> Generator[List[Optional[int]]]:
    if next_element_index == len(all_unique_elements):
        yield used_list_ordered_indices
    else:
        # If we use the element, find an unused list index
        for possible_list_to_use in element_to_containing_lists[
                                        all_unique_elements[next_element_index]] - used_list_indices:
            yield from dfs(used_list_indices | {possible_list_to_use},
                           next_element_index   1,
                           used_list_ordered_indices   [possible_list_to_use])

        # If we don't use the element: Add None as a sentinel value
        yield from dfs(used_list_indices,
                       next_element_index   1,
                       used_list_ordered_indices   [None])


for element_to_used_list in dfs(set(), 0, []):
    list_to_chosen_element = ['N'] * len(choices)
    for x, y in zip(all_unique_elements, element_to_used_list):
        if y is not None:
            list_to_chosen_element[y] = x
    print(*list_to_chosen_element, sep='  ')

First 10 lines of the output:

1  2  4  5  3
1  2  4  6  3
1  2  4  N  3
1  2  N  5  3
1  2  N  6  3
1  2  N  N  3
1  2  4  5  N
1  2  4  6  5
1  2  4  N  5

This can possibly be optimized to run in O(|output|) time by using a bitmask for 'used lists' rather than a set of their indices.

CodePudding user response:

Don't turn the result into a list. product is a generator. Use that to your advantage. filter is a generator too. You only hold one combination in memory this way. Sometimes a single output of product will be discarded internally without you seeing it, but it won't take up any extra space.

def criterion(x):
    k = [i for i in x if i is not None]
    return len(k) == len(set(k))

choices = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
for c in choices:
    c.append(None)
filter(criterion, product(*choices))

CodePudding user response:

Very similar to the previous answer. Not as fast, but I think it's easier to understand. A simple recursive implementation.

def pick_all(choices):
    # The result on an empty choices list is just a single empty list
    if not choices:
        yield []
        return
    # Pluck off the first choice
    first_choice, *remainder = choices

    # Recursively find all solutions to the smaller problem.
    for result in pick_all(remainder):
        # We can always add None to the front
        yield [None, *result]
        for item in first_choice:
            # And we can add any item in this choice that's not already there.
            if item not in result:
                yield [item, *result]

CodePudding user response:

[set(dataset) for dataset in itertools.product(*datasets)]

Creates a combinations list of sets. Sets are like lists but automatically remove existing values. This code does not make sure the sets are of length 5. Note sets are also unordered.


print([dataset for dataset in itertools.product(*datasets) if len(set(dataset)) == len(dataset)])

List of combinations and filters out any lists with duplicate values by making sure a set of the data is the same length as the original list.


datasets = [[1], [2, 4], [4], [5, 6, 2], [5, 3]]
datasets = [[*dataset, None] for dataset in datasets]

new_dataset = []
for dataset in itertools.product(*datasets):
    without_nones = [n for n in dataset if n is not None]
    if len(without_nones) == len(set(without_nones)):
        new_dataset.append(dataset)

print(new_dataset)

Creates a list of combinations where the length is always 5, has no repeating numbers, and can have unlimited Nones.

This should probably be a generator, and I suspect there is a easier way to go about whatever you need a list in this format for.

  • Related