Home > database >  python remove duplicates from a list of list with uneven distribution
python remove duplicates from a list of list with uneven distribution

Time:11-12

i have a python list of lists i want to merge all the containing list with at least 1 common element and remove the similar items

i have a big set of data which is a list of lists, with some common data in some of the containing lists, i want to merge all lists with common data

# sample data
foo = [
[0,1,2,6,9],
[0,1,2,6,5],
[3,4,7,3,2],
[12,36,28,73],
[537],
[78,90,34,72,0],
[573,73],
[99],
[41,44,79],
]

# i want to get this
[
[0,1,2,6,9,5,3,4,7,3,2,78,90,34,72,0],
[12,36,28,73,573,73,573],
[99],
[41,44,79],
]

the elements containing even one common element they are grouped together

the original data file is this

Edit

this is what i am trying

import json

data = json.load(open('x.json')) # https://files.catbox.moe/y1yt5w.json


class Relations:
    def __init__(self):
        pass

    def process_relation(self, flat_data):
        relation_keys = []
        rel = {}
        for i in range(len(flat_data)):
            rel[i] = []
            for n in flat_data:
                if i in n:
                    rel[i].extend(n)
        return {k:list(set(v)) for k,v in rel.items()}

    def process(self, flat_data):
        rawRelations = self.process_relation(flat_data)
        return rawRelations

rel = Relations()
print(json.dumps(rel.process(data), indent=4), file=open('out.json', 'w')) # https://files.catbox.moe/n65tie.json

NOTE - the largest number present in the data will be equal to the length of list of lists

CodePudding user response:

A simple (and probably non-optimal) algorithm that modifies the input data in place:

target_idx = 0

while target_idx < len(data):
    src_idx = target_idx   1
    did_merge = False
    while src_idx < len(data):
        if set(data[target_idx]) & set(data[src_idx]):
            data[target_idx].extend(data[src_idx])
            data.pop(src_idx)  # this was merged
            did_merge = True
            continue  # with same src_idx
        src_idx  = 1
    if not did_merge:
        target_idx  = 1
  • Related