I have the following data structure:
{company_id:
[
(set_id,[product_id,product_id,...]),
(set_id,[product_id,product_id,...]),
(set_id,[product_id,product_id,...])
,...
],
company_id:
[
(set_id,[product_id,product_id,...]),
(set_id,[product_id,product_id,...]),
(set_id,[product_id,product_id,...])
,...
],
}
A sample set of data may be:
{83:
[
(128, []),
(129, [19283, 23837]),
(130, [29553]),
(133, [19283, 20070, 20072, 20087, 20095]),
(134, [20069, 20070, 20071, 20095, 20098])
],
84:
[
(145, [2322,2211]),
(146, [2333, 2211]),
(152, [2333])
],
}
What I need to achieve is:
{83:
[
(128, []),
(130, [29553]),
(133, [19283, 20069, 20070, 20071, 20072, 20087, 20095, 20098, 23837])
],
84:
[
(145, [2322,2211, 2333])
],
}
The result is a list of tuples for each company_id
where no product_id
exists in another tuple's list of an company_id
.
- It is ensured, that each
product_id
exists inside only onecompany_id
's list. - It doesn't matter for which
set_id
theproduct_id
's are going to be merged into - If no value in a tuple's list exists in any other tuple's list, keep it as it is and don't merge it with any other lists
I started to do some nested for
loops already, but feel like it is too complex. Here is a (not working) code of mine that doesn't include the deletion of already watched lists:
import copy
data = {83: [(128, []), (130, [29553]), (133, [19283, 20069, 20070, 20071, 20072, 20087, 20095, 20098, 23837])], 84:[(145, [2322,2211, 2333])],}
final_result = copy.deepcopy(data)
for company, val in data.items():
for set in val:
for x in final_result[company]:
if any(item in x[1] for item in set[1]):
x[1].extend(set[1])
print(final_result)
I would be happy if someone could provide me with a solution to my problem. Also happy to use numpy
or pandas
for it!
CodePudding user response:
This meets your requirements as stated (not a unique solution). It caches product_ids for each set_id, and filters out those already observed when moving to the next (set_id, product_ids) list items.
data = {83:
[
(128, []),
(129, [19283, 23837]),
(130, [29553]),
(133, [19283, 20070, 20072, 20087, 20095]),
(134, [20069, 20070, 20071, 20095, 20098])
],
84:
[
(145, [2322,2211]),
(146, [2333, 2211]),
(152, [2333])
],
}
def clean_data_wrapper(data):
for company_id, product_tuples in data.items():
memo = set()
revised_product_tuples = []
for set_num, product_list in product_tuples:
filtered_ids = set(product_list).difference(memo)
revised_product_tuples.append((set_num, list(filtered_ids)))
memo.update(filtered_ids)
data[company_id] = revised_product_tuples
CodePudding user response:
While @anon01 worked on his solution (which seems much better) I also worked out a solution:
import copy
data = {83: [(128, []), (130, [29553]), (133, [19283, 20069, 20070, 20071, 20072, 20087, 20095, 20098, 23837])], 84:[(145, [2322,2211, 2333])],}
final_result = copy.deepcopy(data)
final_result2 = {}
for company, val in data.items():
for set in val:
for x in final_result[company]:
if any(item in x[1] for item in set[1]):
x[1].extend(set[1])
for x in final_result[company]:
if company not in final_result2:
final_result2[company] = [set(x[1])]
else:
final_result2[company].append(set(x[1]))
print(final_result2)