Grouping a grouped list of str without duplicates-CodePudding

I have a grouped list of strings that sort of looks like this, the lists inside of these groups will always contain 5 elements:

text_list = [['aaa','bbb','ccc','ddd','eee'],
['fff','ggg','hhh','iii','jjj'],
['xxx','mmm','ccc','bbb','aaa'],
['fff','xxx','aaa','bbb','ddd'],
['aaa','bbb','ccc','ddd','eee'],
['fff','xxx','aaa','ddd','eee'],
['iii','xxx','ggg','jjj','aaa']]

The objective is simple, group all of the list that is similar by the first 3 elements that is then compared against all of the elements inside of the other groups.

So from the above example the output might look like this (output is the index of the list):

[[0,2,4],[3,5]]

Notice how if there is another list that contains the same elements but in a different order is removed.

I've written the following code to extract the groups but they would return duplicates and I am unsure how to proceed. I also think this might not be the most efficient way to do the extraction as the real list can contain upwards to millions of groups:

grouped_list = []
for i in range(0,len(text_list)):
    int_temp = []
    for m in range(0,len(text_list)):
        if i == m:
            continue
        bool_check = all( x in text_list[m] for x in text_list[i][0:3])
        
        if bool_check:
            if len(int_temp) == 0:
                int_temp.append(i)
                int_temp.append(m)
                continue
            int_temp.append(m)
           
    
    grouped_list.append(int_temp)
    
## remove index with no groups
grouped_list = [x for x in grouped_list if x != []]

Is there a better way to go about this? How do I remove the duplicate group afterwards? Thank you.

Edit:

To be clearer, I would like to retrieve the lists that is similar to each other but only using the first 3 elements of the other lists. For example, using the first 3 elements from list A, check if list B,C,D... contains all 3 of the elements from list A. Repeat for the entire list then remove any list that contains duplicate elements.

CodePudding user response：

You can build a set of frozensets to keep track of indices of groups with the first 3 items being a subset of the rest of the members:

groups = set()
sets = list(map(set, text_list))
for i, lst in enumerate(text_list):
    groups.add(frozenset((i, *(j for j, s in enumerate(sets) if set(lst[:3]) <= s))))
print([sorted(group) for group in groups if len(group) > 1])

If the input list is long, it would be faster to create a set of frozensets of the first 3 items of all sub-lists and use the set to filter all combinations of 3 items from each sub-list, so that the time complexity is essentially linear to the input list rather than quadratic despite the overhead in generating combinations:

from itertools import combinations

sets = {frozenset(lst[:3]) for lst in text_list}
groups = {}
for i, lst in enumerate(text_list):
    for c in map(frozenset, combinations(lst, 3)):
        if c in sets:
            groups.setdefault(c, []).append(i)
print([sorted(group) for group in groups.values() if len(group) > 1])