Combining lists by identifying overlaps between their values and ensure uniqueness between lists in-CodePudding

I have the following data structure:

{company_id: 
   [
      (set_id,[product_id,product_id,...]),
      (set_id,[product_id,product_id,...]),
      (set_id,[product_id,product_id,...])
      ,...
   ],
company_id:
   [
      (set_id,[product_id,product_id,...]),
      (set_id,[product_id,product_id,...]),
      (set_id,[product_id,product_id,...])
      ,...
   ],
}

A sample set of data may be:

{83: 
   [
      (128, []), 
      (129, [19283, 23837]), 
      (130, [29553]), 
      (133, [19283, 20070, 20072, 20087, 20095]), 
      (134, [20069, 20070, 20071, 20095, 20098])
   ],
84:
   [
      (145, [2322,2211]), 
      (146, [2333, 2211]), 
      (152, [2333])
   ],
}

What I need to achieve is:

{83: 
   [
      (128, []), 
      (130, [29553]), 
      (133, [19283, 20069, 20070, 20071, 20072, 20087, 20095, 20098, 23837])
   ],
84:
   [
      (145, [2322,2211, 2333])
   ],
}

The result is a list of tuples for each company_id where no product_id exists in another tuple's list of an company_id.

It is ensured, that each product_id exists inside only one company_id's list.
It doesn't matter for which set_id the product_id's are going to be merged into
If no value in a tuple's list exists in any other tuple's list, keep it as it is and don't merge it with any other lists

I started to do some nested for loops already, but feel like it is too complex. Here is a (not working) code of mine that doesn't include the deletion of already watched lists:

import copy

data = {83: [(128, []), (130, [29553]), (133, [19283, 20069, 20070, 20071, 20072, 20087, 20095, 20098, 23837])], 84:[(145, [2322,2211, 2333])],}

final_result = copy.deepcopy(data)
for company, val in data.items():
    for set in val:
        for x in final_result[company]:
            if any(item in x[1] for item in set[1]):
                x[1].extend(set[1])
                
print(final_result)

I would be happy if someone could provide me with a solution to my problem. Also happy to use numpy or pandas for it!

CodePudding user response：

This meets your requirements as stated (not a unique solution). It caches product_ids for each set_id, and filters out those already observed when moving to the next (set_id, product_ids) list items.

data = {83: 
       [
          (128, []), 
          (129, [19283, 23837]), 
          (130, [29553]), 
          (133, [19283, 20070, 20072, 20087, 20095]), 
          (134, [20069, 20070, 20071, 20095, 20098])
       ],
    84:
       [
          (145, [2322,2211]), 
          (146, [2333, 2211]), 
          (152, [2333])
       ],
    }

def clean_data_wrapper(data):
    for company_id, product_tuples in data.items():
        memo = set()
        revised_product_tuples = []
        for set_num, product_list in product_tuples:
            filtered_ids = set(product_list).difference(memo)
            revised_product_tuples.append((set_num, list(filtered_ids)))
            memo.update(filtered_ids)
        data[company_id] = revised_product_tuples

CodePudding user response：

While @anon01 worked on his solution (which seems much better) I also worked out a solution:

import copy

data = {83: [(128, []), (130, [29553]), (133, [19283, 20069, 20070, 20071, 20072, 20087, 20095, 20098, 23837])], 84:[(145, [2322,2211, 2333])],}

final_result = copy.deepcopy(data)
final_result2 = {}
for company, val in data.items():
    for set in val:
        for x in final_result[company]:
            if any(item in x[1] for item in set[1]):
                x[1].extend(set[1])
            
    for x in final_result[company]:
        if company not in final_result2:
            final_result2[company] = [set(x[1])]
        else:
            final_result2[company].append(set(x[1]))
print(final_result2)