Home > Blockchain >  Remove from a list (or DataFrame) substrings contained the same list
Remove from a list (or DataFrame) substrings contained the same list

Time:06-21

I have a list of lists (let's call it IDlist): what i want to do is removing the elements (lists) of IDlist that are "substrings" of other elements (other lists) of IDlist.

It is not necessary to use lists, also Pandas objects are good if it's easier.

The only ways i've come up with work only partially (only in specific scenarios) therefore they are useless. I really don't know how make the list work on "itself".

Here is a part of the dataset. For example, lines 61,62,63,64. 61,62 and 64 are substrings of 63, so i should keep only the line 63.

56 ['2588446634610274688', '2588446634612110336']
57 ['348020242217448576', '348020448377061376', '348020482735930112']
58 ['565983471644073472', '565989347158652288']
59 ['4912580642524184960', '4912898156569562624']
60 ['318121222523445376', '318121256883850112']
61 ['356731363606425856', '357478894075788928', '357479272034582528']
62 ['356731363606425856', '357478894075788928', '357479272034582528']
63 ['356731363606425856', '356731363608936576', '357478894075788928', '357479272034582528']
64 ['356731363606425856', '356731363608936576', '357478894075788928']
65 ['2512629230496996992', '2512629230497166848']

Print command output:

>>> print(templist)
[['318121222523445376', '318121256883850112'], ['356731363606425856', '357478894075788928', '357479272034582528'], ['356731363606425856', '357478894075788928', '357479272034582528'], ['356731363606425856', '356731363608936576', '357478894075788928', '357479272034582528'], ['356731363606425856', '356731363608936576', '357478894075788928'], ['2512629230496996992', '2512629230497166848']]

CodePudding user response:

The only solution I found is to iterate through IDlist with nested loops and pop subset list from a copy of IDlist

IDlist = [['2588446634610274688', '2588446634612110336'],
          ['348020242217448576', '348020448377061376', '348020482735930112'],
          ['565983471644073472', '565989347158652288'],
          ['4912580642524184960', '4912898156569562624'],
          ['318121222523445376', '318121256883850112'],
          ['318121222523445376', '318121256883850112'],
          ['356731363606425856', '357478894075788928', '357479272034582528'],
          ['356731363606425856', '357478894075788928', '357479272034582528'],
          ['356731363606425856', '356731363608936576', '357478894075788928', '357479272034582528'],
          ['356731363606425856', '356731363608936576', '357478894075788928'],
          ['2512629230496996992', '2512629230497166848'], ]


def is_subset(a, b):
    for i in a:
        if i not in b:
            return False
    return True


new_IDlist = IDlist.copy()


for id_j, j in enumerate(IDlist):
    for id_k, k in enumerate(IDlist):
        if id_k == id_j:
            continue
        if len(k) < len(j):
            if is_subset(k, j):
                for _, l in enumerate(new_IDlist):
                    if k == l:
                        new_IDlist.pop(_)
                        break
        else:
            if is_subset(j, k):
                cnt = 0
                for _, l in enumerate(new_IDlist):
                    if k == l:
                        if cnt:
                            new_IDlist.pop(_)
                        else:
                            cnt  = 1

Output

['2588446634610274688', '2588446634612110336']
['348020242217448576', '348020448377061376', '348020482735930112']
['565983471644073472', '565989347158652288']
['4912580642524184960', '4912898156569562624']
['318121222523445376', '318121256883850112']
['356731363606425856', '356731363608936576', '357478894075788928', '357479272034582528']
['2512629230496996992', '2512629230497166848']

CodePudding user response:

You can check if the current list is a subset or superset of all other lists with a nested loop. Let's see:

def exclude_subsets(data):
    cleaned_data = set(map(lambda x: tuple(sorted(x)), data))
    
    superset_list = []
    for k,i in enumerate(cleaned_data):
        for l,j in enumerate(cleaned_data):
            if k != l:
                superset_list.append([k,l,set(i).issuperset(j)])
    subset_ids = list(map(lambda x: x[1], filter(lambda x: x[2], superset_list)))
    return [list(i) for k,i in enumerate(cleaned_data) if k not in subset_ids]
data = [['2588446634610274688', '2588446634612110336'],
          ['348020242217448576', '348020448377061376', '348020482735930112'],
          ['565983471644073472', '565989347158652288'],
          ['4912580642524184960', '4912898156569562624'],
          ['318121222523445376', '318121256883850112'],
          ['318121222523445376', '318121256883850112'],
          ['356731363606425856', '357478894075788928', '357479272034582528'],
          ['356731363606425856', '357478894075788928', '357479272034582528'],
          ['356731363606425856', '356731363608936576', '357478894075788928', '357479272034582528'],
          ['356731363606425856', '356731363608936576', '357478894075788928'],
          ['2512629230496996992', '2512629230497166848']]


print(exclude_subsets(data))
>>
[['565983471644073472', '565989347158652288'],
 ['4912580642524184960', '4912898156569562624'],
 ['356731363606425856','356731363608936576','357478894075788928','357479272034582528'],
 ['2588446634610274688', '2588446634612110336'],
 ['348020242217448576', '348020448377061376', '348020482735930112'],
 ['2512629230496996992', '2512629230497166848'],
 ['318121222523445376', '318121256883850112']]
  • Related