I have a df that looks like this:
ID dates holidays
0 ABC 01/01/2022 [12/25/2021, 01/01/2022, 01/17/2022]
1 DEF 03/17/2022 [01/01/2022, 06/01/2022, 27/01/2022]
I need to create another column isHoliday
with bool value if the value in column dates is in holidays list for that id. I am trying to use the code below but it shows an error: TypeError: unhashable type: 'list'
:
df['isHoliday'] = np.where(df['dates'].isin(df['holidays']), True, False)
Is it any way to find whether the columns dates value is in the list that corresponds each ID?
CodePudding user response:
explode
works really well here:
df['isHoliday'] = df.eval('dates in holidays.explode()')
Output:
>>> df
ID dates holidays isHoliday
0 ABC 01/01/2022 [12/25/2021, 01/01/2022, 01/17/2022] True
1 DEF 03/17/2022 [01/01/2022, 06/01/2022, 27/01/2022] False
Benchmarks:
def scott():
[d in l for d, l in zip(df['dates'], df['holidays'])]
def shubham():
pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)
def charles():
df[['dates','holidays']].apply(lambda row: row['dates'] in row['holidays'], axis=1)
def richardec():
df.eval('dates in holidays.explode()')
###
%timeit scott()
# 5.99 µs ± 302 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit shubham()
# 547 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit charles()
# 525 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit richardec()
# 971 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
So mine is the shortest, but Scott's is the fastest!
CodePudding user response:
Here is a vectorized approach:
df['isholiday'] = pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)
Result
ID dates holidays isholiday
0 ABC 01/01/2022 [12/25/2021, 01/01/2022, 01/17/2022] True
1 DEF 03/17/2022 [01/01/2022, 06/01/2022, 27/01/2022] False
CodePudding user response:
Try this:
df['isholiday'] = [d in l for d, l in zip(df['dates'], df['holidays'])]
Output:
ID dates holidays isholiday
0 ABC 01/01/2022 [12/25/2021, 01/01/2022, 01/17/2022] True
1 DEF 03/17/2022 [01/01/2022, 06/01/2022, 01/27/2022] False
CodePudding user response:
The source of the issue is that the .isin()
method does not vectorize its argument, so it's checking if each date is in the whole series of lists, and in the process I believe it casts the argument to a set for speed purposes. Sets only accept immutable values, and lists are mutable, so it gives the unhashable type error.
I think you'll want to use an apply instead:
df['isHoliday'] = df[['dates','holidays']].apply(
lambda row: row['dates'] in row['holidays'],
axis=1
)
In this case you don't need the np.where because the condition itself returns a series of True
s and False
s.