I have a variable foo
, which points to a string, "bar"
foo = "bar"
I have a list, called whitelist
.
If whitelist
is not empty, the elements contained are a whitelist.
If whitelist
is empty, then the if statement permits any string.
I have implemented this as follows
whitelist = ["bar", "baz", "x", "y"]
if whitelist and foo in whitelist:
print("bar is whitelisted")
# do something with whitelisted element
if whitelist
, by my understanding, checks if whitelist
returns True
. whitelist will be False
if whitelist is empty. If whitelist
contains elements, it will return True
.
However, the real implementation of this contains:
- lots of strings to check e.g. `"bar", "baz", "x", "y", "a", "b"
- lots of whitelists to check against
Therefore, I was wondering if there is a more computationally efficient way of writing the if statement. It seems like checking the existence of whitelist each time is inefficient, and could be simplified.
CodePudding user response:
These are some ways to check whether an element is in a list or not.
from timeit import timeit
import numpy as np
whitelist1 = {"bar", "baz", "x", "y"}
whitelist2 = np.array(["bar", "baz", "x", "y"])
def func1():
return {"foo"}.intersection(whitelist1)
def func2():
return "foo" in whitelist1
def func3():
return np.isin('foo',whitelist1)
def func4():
return whitelist2[np.searchsorted(whitelist2, 'foo')] == 'foo'
print("func1=",timeit(func1,number=100000))
print("func2=",timeit(func2,number=100000))
print("func3",timeit(func3,number=100000))
print("func4=",timeit(func4,number=100000))
Time Taken by each function
func1= 0.01365450001321733
func2= 0.005112499929964542
func3 0.5342871999600902
func4= 0.17057700001168996
FOr randomly generated list
from timeit import timeit
import numpy as np
import random as rn
from string import ascii_letters
# randomLst = for a in range(500) rn.choices(ascii_letters,k=5)
randomLst = []
while len(randomLst) !=1000:
radomWord = ''.join(rn.choices(ascii_letters,k=5))
if radomWord not in randomLst:
randomLst.append(radomWord)
whitelist1 = {*randomLst}
whitelist2 = np.array(randomLst)
randomWord = rn.choice(randomLst)
randomWords = set(rn.choices(randomLst, k=100))
def func1():
return {randomWord}.intersection(whitelist1)
def func2():
return randomWord in whitelist1
def func3():
return np.isin('foo',whitelist1)
def func4():
return whitelist2[np.searchsorted(whitelist2, randomWord)] == randomWord
def func5():
return randomWords & whitelist1
print("func1=",timeit(func1,number=100000))
print("func2=",timeit(func2,number=100000))
print("func3",timeit(func3,number=100000))
print("func4=",timeit(func4,number=100000))
print("func5=",timeit(func5,number=1000)) # Here I change the number to 1000 because we check the 100 randoms word at one so number = 100000/100 = 1000.
Time taken
func1= 0.012835499946959317
func2= 0.005004600039683282
func3 0.5219665999757126
func4= 0.19900090002920479
func5= 0.0019264000002294779
Conclusion
If you want to check only one word then 'in' statement is fast
But, if you have a list of word then '&' statement is fast 'func5'
Note: function 5 returns a set with the words that are in the whitelist
CodePudding user response:
whitelist
would exist, but if it's possible None
coerce with:
whitelist = whitelist or []
As shared above then you can just foo in whitelist
to figure out if it's in the list. This is O(len(whitelist))
operation. Arrays are surprisingly fast (say, for at least len(whitelist) >= 1,000
) in practice.
If you need it to be faster use a set, and optionally if you need to do n
lookup collect your foos into a set then use intersect for O(n)
:
foos = { 'bar', 'none' }
whitelist = { 'bar' }
for foo in foos & whitelist:
print(foo)
CodePudding user response:
Here is the simplified solution, You can do that with two methods
whitelist = ["bar", "baz", "x", "y"]
foo = "bar"
# method 1
def WhiteListExists(foo, whitelist):
if whitelist and foo in whitelist:
return True
else:
return False
exists = WhiteListExists(foo,whitelist)
# method 2
exists = True if whitelist and foo in whitelist else False
Both methods do the same but the second one is fast.