How to get the intersection of multiple nested lists?-CodePudding

I am playing with python and wonder how to get the intersection of multiple nested lists.

list1 = [[10, 1], [200, 2], [300, 9], [400, 1], [500, 1]]
list2 = [[22, 1], [200, 2], [300, 9], [900, 1], [660, 1], [500, 1]]
list3 = [[30, 1], [200, 2], [300, 0], [400, 1], [500, 1]]

Each list contains multiple small lists, each small list contains two integers. The first integer of each small list is different in each list.

Now, if the first integer appears in each list , then keep it, sum the second integer.

Then the result should be [[200,6],[300,18],[500,3]]

A friend solved this problem:

import pandas as pd
list1 = [[100, 1], [200, 2], [300, 99], [400, 1], [500, 1]]
list2 = [[22, 1], [200, 2], [300, 9], [900, 1],[1000, 1], [500, 1]]
list3 = [[30, 1], [200, 2], [300, 0], [400, 1], [500, 1]]

df1 = pd.DataFrame(list1)
df2 = pd.DataFrame(list2)
df3 = pd.DataFrame(list3)

df4 = df1.merge(df2, on=0, how='inner')

df5 = df4.merge(df3, on=0, how='inner')

df6 = df5.copy()
df6['sum'] = df6.iloc[:, 1:].sum(axis=1)
list4 = df6[[0, 'sum']].values.tolist()

print(list4)

When the data is huge, the running speed is slow. I wonder if there is any method that can use "set" to speed up the speed?

CodePudding user response：

Each list contains multiple small lists, each small list contains two integers. The first integer of each small list is different in each list.

I'd suggest converting your nested lists to dictionaries, which seem to make way more sense here -- not just for this operation, but for working with the data in general. Then, all you need is a simple dictionary comprehension.

>>> d1, d2, d3 = dict(list1), dict(list2), dict(list3)
>>> d1
{10: 1, 200: 2, 300: 9, 400: 1, 500: 1}
>>> {k: d1[k]   d2[k]   d3[k] for k in d1 if k in d2 and k in d3}
{200: 6, 300: 18, 500: 3}

Or the same as a nested list comprehension, if you really need lists:

>>> [[k, d1[k]   d2[k]   d3[k]] for k in d1 if k in d2 and k in d3]
[[200, 6], [300, 18], [500, 3]]

About running time: After creating the dicts in O(n), the if k in d2 and k in d3 checks are O(1) for an overall complexity of O(n), which should be about as fast as it gets.

About suggestion from comments: Using d1.keys() & d2.keys() & d3.keys() works, too, but does not seem to be any faster (a bit slower in fact, for 1000 elements with 25% overlap).

>>> d1, d2, d3 = ({random.randrange(1000): random.randrange(1000) for _ in range(1000)} for _ in range(3))                                                              
>>> len(set(d1) & set(d2) & set(d3))                                                                                              
259
>>> %timeit {k: d1[k]   d2[k]   d3[k] for k in d1 if k in d2 and k in d3}                                                                                               
84.2 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit {k: d1[k]   d2[k]   d3[k] for k in d1.keys() & d2.keys() & d3.keys()}                                                                                       
117 µs ± 817 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

CodePudding user response：

For large data, it may be more efficient to just iterate the values once, summing for each key and counting the occurrences of each key. Then it's a simple list generation to get your desired result:

d = {}
c = {}
for k, v in list1 list2 list3:
    d[k] = d.get(k, 0)   v
    c[k] = c.get(k, 0)   1

[[k, v] for k, v in d.items() if c[k] == 3]

Output:

[[200, 6], [300, 18], [500, 3]]

Note you can optimise the above code slightly by using a defaultdict:

d = defaultdict(int)
c = defaultdict(int)
for k, v in list1 list2 list3:
    d[k] = d[k]   v
    c[k] = c[k]   1