We have a pandas DataFrame df
and a set of values set_vals
.
For a particular column (let's say 'name'
), I would now like to compute a new column which is True
whenever the value of df['name']
is in set_vals
and False
otherwise.
One way to do this is to write:
df['name'].apply(lambda x : x in set_vals)
but when both df
and set_vals
become large this method is very slow. Is there a more efficient way of creating this new column?
CodePudding user response:
The real problem is the complexity of df['name'].apply(lambda x : x in set_vals)
is O(M*N) where M is the length of df
and N is the length of set_vals
if set_vals
is a list (or another type for which the search complexity is linear).
The complexity can be improved to O(M) if set_vals
is hashed (turned into dict
type) and the search complexity will be O(1).
CodePudding user response:
It is a complex problem with a simple solution, you can try to run multiple threads with this for loop:
let's say [0:i], [i 1:j], [j 1,k]
etc.
Here is a very good explanation of how to do multiple threads
Also, if you are interested in more details about performance and efficiency check this out.