Check if there are duplicates elements in a flat list contained in a pandas serie-CodePudding

Check if there are duplicates elements in a flat list contained in a pandas serie

Issue

One column of my dataframe contains lists.

I'm tring to compute a new column (element_unique) returning:

1 if the list not contains duplicate element
0 if the list contains duplicate element

I managed to do this by iterating line by line using apply. I would like to know if there is a more efficient way to do this calculation without iterating line by line?

Input data

import pandas as pd

df = pd.DataFrame(data={                                                                   
    'key':   ["A", "B", "C", "D"],                                                         
    'dim': [["3", "1", "2"], [6, 5, 6], ["1"], ["2", "2"]]})    
df
#   key        dim
# 0   A  [3, 1, 2]
# 1   B  [6, 5, 6]
# 2   C        [1]
# 3   D     [2, 2]

Calculation

def is_list_of_unique_values(list_):                                                       
    if len(list_) == len(set(list_)):                                                      
        return 1                                                                           
    return 0                                                                               
                                                                                     
df.loc[:, "element_unique"] = df["dim"].apply(is_list_of_unique_values) 

df
#   key        dim  element_unique
# 0   A  [3, 1, 2]               1
# 1   B  [6, 5, 6]               0
# 2   C        [1]               1
# 3   D     [2, 2]               0

CodePudding user response：

You can use np.vectorize

The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

Code

df["element_unique"] = np.vectorize(is_list_of_unique_values)(df["dim"])

Performance Comparison

1.68X speed up for posted dataframe

Posted code

%timeit df["element_unique"] = df["dim"].apply(is_list_of_unique_values) 
496 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.vectorize

%timeit df["element_unique"] = np.vectorize(is_list_of_unique_values)(df["dim"])
295 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CodePudding user response：

You can check the length of set, which has unique elements by definition:

df.dim.apply(lambda x: ( len(x) == len(set(x)) ) * 1)

Note that I multiplied the result with integer 1 so that the output will be 1 or 0 instead of boolean values.