Check if there are duplicates elements in a flat list contained in a pandas serie
Issue
One column of my dataframe contains lists.
I'm tring to compute a new column (element_unique
) returning:
1
if the list not contains duplicate element0
if the list contains duplicate element
I managed to do this by iterating line by line using apply
. I would like to know if there is a more efficient way to do this calculation without iterating line by line?
Input data
import pandas as pd
df = pd.DataFrame(data={
'key': ["A", "B", "C", "D"],
'dim': [["3", "1", "2"], [6, 5, 6], ["1"], ["2", "2"]]})
df
# key dim
# 0 A [3, 1, 2]
# 1 B [6, 5, 6]
# 2 C [1]
# 3 D [2, 2]
Calculation
def is_list_of_unique_values(list_):
if len(list_) == len(set(list_)):
return 1
return 0
df.loc[:, "element_unique"] = df["dim"].apply(is_list_of_unique_values)
df
# key dim element_unique
# 0 A [3, 1, 2] 1
# 1 B [6, 5, 6] 0
# 2 C [1] 1
# 3 D [2, 2] 0
CodePudding user response:
You can use np.vectorize
- The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.
Code
df["element_unique"] = np.vectorize(is_list_of_unique_values)(df["dim"])
Performance Comparison
1.68X speed up for posted dataframe
Posted code
%timeit df["element_unique"] = df["dim"].apply(is_list_of_unique_values)
496 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.vectorize
%timeit df["element_unique"] = np.vectorize(is_list_of_unique_values)(df["dim"])
295 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CodePudding user response:
You can check the length of set
, which has unique elements by definition:
df.dim.apply(lambda x: ( len(x) == len(set(x)) ) * 1)
Note that I multiplied the result with integer 1 so that the output will be 1 or 0 instead of boolean values.