I want to know which elements of list_1
are in list_2
. I need the output as an ordered list of booleans. But I want to avoid for
loops, because both lists have over 2 million elements.
This is what I have and it works, but it's too slow:
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = []
for i in list_1:
booleans.append(i in list_2)
# booleans = [False, False, True, True, False, False]
I could split the list and use multithreading, but I would prefer a simpler solution if possible. I know some functions like sum() use vector operations. I am looking for something similar.
How can I make my code more efficient?
CodePudding user response:
You can take advantage of the O(1)
in operator complexity for the set()
function to make your for loop more efficient, so your final algorithm would run in O(n)
time, instead of O(n*n)
:
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
s = set(list_2)
booleans = []
for i in list_1:
booleans.append(i in s)
print(booleans)
If you only want to know the elements, you can use an intersection of sets like that, which will be an efficient solution due to the use of set()
function, already optimized by other Python engineers:
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
print(set(list_1).intersection(set(list_2)))
Output:
{1, 2}
Also, to provide a list format output, you can turn your resulting set into a list with list()
function:
print(list(set(list_1).intersection(set(list_2))))
CodePudding user response:
I thought it would be useful to actually time some of the solutions presented here on a larger sample input. For this input and on my machine, I find Cardstdani's approach to be the fastest, followed by the numpy
isin()
approach.
The setup
import random
list_1 = [random.randint(1, 10_000) for i in range(100_000)]
list_2 = [random.randint(1, 10_000) for i in range(100_000)]
Timings - ordered from fastest to slowest.
Cardstdani - approach 1
List comprehension (see this question for why list comprehensions are faster)
s = set(list_2)
booleans = [i in s for i in list_1]
6.47 ms ± 54.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
No list comprehension
s = set(list_2)
booleans = []
for i in list_1:
booleans.append(i in s)
8.04 ms ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Cardstdani - approach 2 (with an assist from Timus)
common = set(list_1) & set(list_2)
booleans = [item in common for item in list_1]
8.55 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numpy approach (crissal)
a1 = np.array(list_1)
a2 = np.array(list_2)
np.isin(a1, a2)
19.3 ms ± 54.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
list comprehension
l = [i in list_2 for i in list_1]
5.04 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sharim - approach 1
booleans = list(map(lambda e:e in list_2, iter(list_1)))
5.17 s ± 26.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sharim - approach 2
set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, list_1))
5.98 s ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sharim - approach 3
set_2 = set(list_2)
booleans = list(map(lambda e: set_2 != set_2 - {e}, iter(list_1)))
6.05 s ± 44.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Varying the length of the input
Applying a slightly different setup to the above
import random
list_1 = [random.randint(1, n) for i in range(n)]
list_2 = [random.randint(1, n) for i in range(n)]
and varying n
in [2 ** k for k in range(20)]
(the best of each person's approaches is shown):
CodePudding user response:
If you want to use a vector approach you can also use Numpy isin. It's not the fastest method, as demonstrated by oda's excellent post, but it's definitely an alternative to consider.
import numpy as np
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
a1 = np.array(list_1)
a2 = np.array(list_2)
np.isin(a1, a2)
# array([False, False, True, True, False, False])
CodePudding user response:
You can use the map
function. If you are not familiar with the lambda function then you can check this out.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = list(map(lambda e:e in list_2,iter(list_1)))
print(booleans)
output
[False, False, True, True, False, False]
However, if you want the only elements which are not the same then instead of a map
function you can use the filter
function with the same code.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
new_lst = list(filter(lambda e:e in list_2,iter(list_1)))# edited instead of map use filter.
print(new_lst)
output
[1, 2]
Edited
I am removing the in
statement from the code because in
also acts as a loop. I am checking using the timeit
module.
you can use this code for the list containing True
and False
.
This way is fastest then above one.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)
booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(booleans)
output
[False, False, True, True, False, False]
This one is for the list containing the elements.
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
set_2 = set(list_2)
booleans = list(filter(lambda e:set_2!=set_2-{e},iter(list_1))) # edited instead of map use filter
print(booleans)
output
[1,2]
editing: Is really NUMPY
way faster than this one.
This is my way.
import time
a = time.time()
list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
booleans = list(map(lambda e:set_2!=set_2-{e},iter(list_1)))
print(time.time()-a)
Time Taken: 0.5790872573852539
This is the numpy way.
import time
a = time.time()
import numpy as np
list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
a1 = np.array(list_1)
a2 = np.array(list_2)
np.isin(a1, a2)
print(time.time()-a)
Time Taken: 0.6325583362579346
Is this really fast
As Op Never use lambda function then this is for him
list_1 = [0,0,1,2,0,0]*100000
list_2 = [1,2,3,4,5,6]*100000
set_2 = set(list_2)
def func():
return set_2!=set_2-{e}
booleans = list(map(func,iter(list_1)))
I know my way isn't the best way to this answer this because I am never using NumPy
much.
CodePudding user response:
It's probably simpler to just use the built-in set intersection method, but if you have lots of lists that you're comparing, it might be faster to sort the lists. Sorting the list is n ln n, but once you have them sorted, you can compare them in linear time by checking whether the elements match, and if they don't, advance to the next item in the list whose current element is smaller.
CodePudding user response:
Use set()
to get a list of unique items in each list
list_1 = [0,0,1,2,0,0]
list_2 = [1,2,3,4,5,6]
booleans = []
set_1 = set(list_1)
set_2 = set(list_2)
if(set_1 & set_2):
print(set_1 & set_2)
else:
print("No common elements")
Output:
{1, 2}