How do I efficiently find which elements of a list are in another list?-CodePudding

I want to know which elements of list_1 are in list_2. I need the output as an ordered list of booleans. But I want to avoid for loops, because both lists have over 2 million elements.

This is what I have and it works, but it's too slow:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []
for i in list_1:
   booleans.append(i in list_2)

# booleans = [False, False, True, True, False, False]

I could split the list and use multithreading, but I would prefer a simpler solution if possible. I know some functions like sum() use vector operations. I am looking for something similar.

How can I make my code more efficient?

CodePudding user response：

You can take advantage of the O(1) in operator complexity for the set() function to make your for loop more efficient, so your final algorithm would run in O(n) time, instead of O(n*n):

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)
print(booleans)

If you only want to know the elements, you can use an intersection of sets like that, which will be an efficient solution due to the use of set() function, already optimized by other Python engineers:

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

print(set(list_1).intersection(set(list_2)))

Output:

{1, 2}

Also, to provide a list format output, you can turn your resulting set into a list with list() function:

print(list(set(list_1).intersection(set(list_2))))

CodePudding user response：

I thought it would be useful to actually time some of the solutions presented here on a larger sample input. For this input and on my machine, I find Cardstdani's approach to be the fastest, followed by the numpy isin() approach.

The setup

import random

list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(1, 10_000) for i in range(100_000)]

Timings - ordered from fastest to slowest.

Cardstdani - approach 1

List comprehension (see this question for why list comprehensions are faster)

s = set(list_2)
booleans = [i in s for i in list_1]

6.47 ms ± 54.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

No list comprehension

s = set(list_2)
booleans = []
for i in list_1:
   booleans.append(i in s)

8.04 ms ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Cardstdani - approach 2 (with an assist from Timus)

common = set(list_1) & set(list_2)
booleans = [item in common for item in list_1]

8.55 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy approach (crissal)

a1 = np.array(list_1)
a2 = np.array(list_2)

np.isin(a1, a2)

19.3 ms ± 54.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

list comprehension

l = [i in list_2 for i in list_1]

5.04 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sharim - approach 1

booleans = list(map(lambda e:e in list_2, iter(list_1)))

5.17 s ± 26.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sharim - approach 2

set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, list_1))

5.98 s ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sharim - approach 3

set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, iter(list_1)))

6.05 s ± 44.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Varying the length of the input

Applying a slightly different setup to the above

import random 

list_1 = [random.randint(1, n) for i in range(n)]
list_2 = [random.randint(1, n) for i in range(n)]

and varying n in [2 ** k for k in range(20)] (the best of each person's approaches is shown):

CodePudding user response：

If you want to use a vector approach you can also use Numpy isin. It's not the fastest method, as demonstrated by oda's excellent post, but it's definitely an alternative to consider.

import numpy as np

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

a1 = np.array(list_1)
a2 = np.array(list_2)

np.isin(a1, a2)
# array([False, False,  True,  True, False, False])

CodePudding user response：

You can use the map function. If you are not familiar with the lambda function then you can check this out.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = list(map(lambda e:e in list_2,iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

However, if you want the only elements which are not the same then instead of a map function you can use the filter function with the same code.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

new_lst = list(filter(lambda e:e in list_2,iter(list_1)))# edited instead of map use filter.
print(new_lst)

output

[1, 2]

Edited

I am removing the in statement from the code because in also acts as a loop. I am checking using the timeit module.

you can use this code for the list containing True and False.

This way is fastest then above one.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(booleans)

output

[False, False, True, True, False, False]

This one is for the list containing the elements.

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)

booleans = list(filter(lambda e:set_2!=set_2-{e},iter(list_1))) # edited instead of map use filter
print(booleans)

output

[1,2]

editing: Is really `NUMPY` way faster than this one.

This is my way.

import time
a = time.time()
list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)

booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))

print(time.time()-a)

Time Taken: 0.5790872573852539

This is the numpy way.

import time
a = time.time()
import numpy as np

list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000

a1 = np.array(list_1)
a2 = np.array(list_2)

np.isin(a1, a2)

print(time.time()-a)

Time Taken: 0.6325583362579346

Is this really fast

As Op Never use lambda function then this is for him

list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
def func():
    return set_2!=set_2-{e}

booleans = list(map(func,iter(list_1)))

I know my way isn't the best way to this answer this because I am never using NumPy much.

CodePudding user response：

It's probably simpler to just use the built-in set intersection method, but if you have lots of lists that you're comparing, it might be faster to sort the lists. Sorting the list is n ln n, but once you have them sorted, you can compare them in linear time by checking whether the elements match, and if they don't, advance to the next item in the list whose current element is smaller.

CodePudding user response：

Use set() to get a list of unique items in each list

list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]

booleans = []

set_1 = set(list_1)
set_2 = set(list_2)

if(set_1 & set_2):
  print(set_1 & set_2)
else:
  print("No common elements")

Output:

{1, 2}

Edited

editing: Is really NUMPY way faster than this one.

Is this really fast

As Op Never use lambda function then this is for him

editing: Is really `NUMPY` way faster than this one.