Home > Blockchain >  find if the value in pandas column is in another column's list
find if the value in pandas column is in another column's list

Time:03-25

I have a df that looks like this:

    ID       dates                              holidays
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 27/01/2022]

I need to create another column isHoliday with bool value if the value in column dates is in holidays list for that id. I am trying to use the code below but it shows an error: TypeError: unhashable type: 'list':

df['isHoliday'] = np.where(df['dates'].isin(df['holidays']), True, False)

Is it any way to find whether the columns dates value is in the list that corresponds each ID?

CodePudding user response:

explode works really well here:

df['isHoliday'] = df.eval('dates in holidays.explode()')

Output:

>>> df
    ID       dates                              holidays  isHoliday
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 27/01/2022]      False

Benchmarks:

def scott():
    [d in l for d, l in zip(df['dates'], df['holidays'])]

def shubham():
    pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)

def charles():
    df[['dates','holidays']].apply(lambda row: row['dates'] in row['holidays'], axis=1)

def richardec():
    df.eval('dates in holidays.explode()')

###

%timeit scott()
# 5.99 µs ± 302 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit shubham()
# 547 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit charles()
# 525 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit richardec()
# 971 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So mine is the shortest, but Scott's is the fastest!

CodePudding user response:

Here is a vectorized approach:

df['isholiday'] = pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)

Result

    ID       dates                                holidays  isholiday
0  ABC  01/01/2022    [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022,  06/01/2022,  27/01/2022]      False

CodePudding user response:

Try this:

df['isholiday'] = [d in l for d, l in zip(df['dates'], df['holidays'])]

Output:

    ID       dates                              holidays  isholiday
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 01/27/2022]      False

CodePudding user response:

The source of the issue is that the .isin() method does not vectorize its argument, so it's checking if each date is in the whole series of lists, and in the process I believe it casts the argument to a set for speed purposes. Sets only accept immutable values, and lists are mutable, so it gives the unhashable type error.

I think you'll want to use an apply instead:

df['isHoliday'] = df[['dates','holidays']].apply(
    lambda row: row['dates'] in row['holidays'], 
    axis=1
)

In this case you don't need the np.where because the condition itself returns a series of Trues and Falses.

  • Related