Home > database >  Check if there are duplicates elements in a flat list contained in a pandas serie
Check if there are duplicates elements in a flat list contained in a pandas serie

Time:09-05

Check if there are duplicates elements in a flat list contained in a pandas serie

Issue

One column of my dataframe contains lists.

I'm tring to compute a new column (element_unique) returning:

  • 1 if the list not contains duplicate element
  • 0 if the list contains duplicate element

I managed to do this by iterating line by line using apply. I would like to know if there is a more efficient way to do this calculation without iterating line by line?

Input data

import pandas as pd

df = pd.DataFrame(data={                                                                   
    'key':   ["A", "B", "C", "D"],                                                         
    'dim': [["3", "1", "2"], [6, 5, 6], ["1"], ["2", "2"]]})    
df
#   key        dim
# 0   A  [3, 1, 2]
# 1   B  [6, 5, 6]
# 2   C        [1]
# 3   D     [2, 2]

Calculation

def is_list_of_unique_values(list_):                                                       
    if len(list_) == len(set(list_)):                                                      
        return 1                                                                           
    return 0                                                                               
                                                                                     
df.loc[:, "element_unique"] = df["dim"].apply(is_list_of_unique_values) 

df
#   key        dim  element_unique
# 0   A  [3, 1, 2]               1
# 1   B  [6, 5, 6]               0
# 2   C        [1]               1
# 3   D     [2, 2]               0

CodePudding user response:

You can use np.vectorize

  • The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

Code

df["element_unique"] = np.vectorize(is_list_of_unique_values)(df["dim"])

Performance Comparison

1.68X speed up for posted dataframe

Posted code

%timeit df["element_unique"] = df["dim"].apply(is_list_of_unique_values) 
496 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.vectorize

%timeit df["element_unique"] = np.vectorize(is_list_of_unique_values)(df["dim"])
295 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CodePudding user response:

You can check the length of set, which has unique elements by definition:

df.dim.apply(lambda x: ( len(x) == len(set(x)) ) * 1)

Note that I multiplied the result with integer 1 so that the output will be 1 or 0 instead of boolean values.

  • Related