I have workers dictionary with names as keys and lists of two dates as values, that form some time interval:
workers = {
'Alan': [dt.date(2017, 1, 15), dt.date(2020, 5, 4)],
'Ben': [dt.date(2018, 3, 28), dt.date(2021, 5, 7)]
}
And also I have df dataframe with some events, which one is related to some worker:
event_id person event_date
1LFDVDX Alan 2018-10-28
4DLDQVC Ben 2022-02-01
5PEXVGH Ben 2019-09-05
9OPCLXD John 2020-06-15
So I'm trying to filter df with query() using dictionary as multiple condition. Particularly in described case I want to get from df all events for Alan with event_time
between 2017-01-15 and 2020-05-04 and also all events for Ben with event_time
between 2018-03-28 and 2021-05-07. So expected result should be as below:
event_id worker event_date
1LFDVDX Alan 2018-10-28
5PEXVGH Ben 2019-09-05
I was trying next one:
df.query('person in @workers and event_date >= @workers[person][0] and event_date <= @workers[person][1]')
But got TypeError: unhashable type: 'Series'
How can I solve this? Thank you in advance.
CodePudding user response:
Problem
The problem is in this expression:
person in @workers
Here person
is a Series
that is a not hashable object, which is being searched in the keys of the dictionary. To solve this particular problem you use isin
, but that only answers the question partially.
Solution
I suggest you do this instead:
from collections import defaultdict
# create a defaultdict to be used in map, is useful for given a default values to NaN
lookup = defaultdict(list, workers)
# create new Series applying the lookup dictionary
person_map = df["person"].map(lookup)
# convert the series to a DataFrame with the same index as df
ran = pd.DataFrame(data=person_map.to_list(), columns=["start", "end"], index=person_map.index)
# use a very clean query
res = df.query("person.isin(@workers) and @ran.start < event_date < @ran.end")
print(res)
Output
event_id person event_date
0 1LFDVDX Alan 2018-10-28
2 5PEXVGH Ben 2019-09-05
CodePudding user response:
I would add temporary filter columns with join and query on that:
df_workers = pd.DataFrame(workers).T.rename(columns={0: 'start', 1: 'end'})
df2 = df.join(df_workers, on='person')
df2.query('event_date >= start and event_date <= end')[df.columns]
If your filter dict is small, you can also compose a full explicit query:
query = ' or '.join([
f"(person == '{k}' and event_date >= '{v[0]}' and event_date <= '{v[1]}')"
for k, v in workers.items()
])
df.query(query)
CodePudding user response:
I believe that there's hardly a way to avoid a loop here. I suggest the following code:
import datetime as dt
import pandas as pd
workers = {
"Alan": [dt.date(2017, 1, 15), dt.date(2020, 5, 4)],
"Ben": [dt.date(2018, 3, 28), dt.date(2021, 5, 7)],
}
event_df = pd.DataFrame(
{
"event_id": ["1LFDVDX", "4DLDQVC", "5PEXVGH", "9OPCLXD"],
"person": ["Alan", "Ben", "Ben", "John"],
"event_data": [
dt.date(2018, 10, 28),
dt.date(2022, 2, 21),
dt.date(2019, 9, 5),
dt.date(2020, 6, 15),
],
}
)
df_filtered = pd.DataFrame()
for worker in workers:
df_worker = event_df.query("person == @worker")
worker_start = workers[worker][0]
worker_end = workers[worker][1]
df_worker = df_worker.query("@worker_start <= event_data <= @worker_end")
df_filtered = pd.concat((df_filtered, df_worker), axis=0)
Output:
event_id person event_data
1LFDVDX Alan 2018-10-28
5PEXVGH Ben 2019-09-05