python - How can I use a dictionary as multiple condition in pandas.query()?-CodePudding

I have workers dictionary with names as keys and lists of two dates as values, that form some time interval:

workers = {
    'Alan': [dt.date(2017, 1, 15), dt.date(2020, 5, 4)],
    'Ben': [dt.date(2018, 3, 28), dt.date(2021, 5, 7)]
}

And also I have df dataframe with some events, which one is related to some worker:

event_id    person  event_date
1LFDVDX     Alan    2018-10-28
4DLDQVC     Ben     2022-02-01
5PEXVGH     Ben     2019-09-05
9OPCLXD     John    2020-06-15

So I'm trying to filter df with query() using dictionary as multiple condition. Particularly in described case I want to get from df all events for Alan with event_time between 2017-01-15 and 2020-05-04 and also all events for Ben with event_time between 2018-03-28 and 2021-05-07. So expected result should be as below:

event_id    worker  event_date
1LFDVDX     Alan    2018-10-28
5PEXVGH     Ben     2019-09-05

I was trying next one:

df.query('person in @workers and event_date >= @workers[person][0] and event_date <= @workers[person][1]')

But got TypeError: unhashable type: 'Series'

How can I solve this? Thank you in advance.

CodePudding user response：

Problem

The problem is in this expression:

person in @workers

Here person is a Series that is a not hashable object, which is being searched in the keys of the dictionary. To solve this particular problem you use isin, but that only answers the question partially.

Solution

I suggest you do this instead:

from collections import defaultdict

# create a defaultdict to be used in map, is useful for given a default values to NaN
lookup = defaultdict(list, workers)

# create new Series applying the lookup dictionary
person_map = df["person"].map(lookup)

# convert the series to a DataFrame with the same index as df
ran = pd.DataFrame(data=person_map.to_list(), columns=["start", "end"], index=person_map.index)

# use a very clean query
res = df.query("person.isin(@workers) and @ran.start < event_date < @ran.end")

print(res)

Output

  event_id person event_date
0  1LFDVDX   Alan 2018-10-28
2  5PEXVGH    Ben 2019-09-05

CodePudding user response：

I would add temporary filter columns with join and query on that:

df_workers = pd.DataFrame(workers).T.rename(columns={0: 'start', 1: 'end'})
df2 = df.join(df_workers, on='person')
df2.query('event_date >= start and event_date <= end')[df.columns]

If your filter dict is small, you can also compose a full explicit query:

query = ' or '.join([
    f"(person == '{k}' and event_date >= '{v[0]}' and event_date <= '{v[1]}')"
    for k, v in workers.items()
])
df.query(query)

CodePudding user response：

I believe that there's hardly a way to avoid a loop here. I suggest the following code:

import datetime as dt
import pandas as pd

workers = {
    "Alan": [dt.date(2017, 1, 15), dt.date(2020, 5, 4)],
    "Ben": [dt.date(2018, 3, 28), dt.date(2021, 5, 7)],
}
event_df = pd.DataFrame(
    {
        "event_id": ["1LFDVDX", "4DLDQVC", "5PEXVGH", "9OPCLXD"],
        "person": ["Alan", "Ben", "Ben", "John"],
        "event_data": [
            dt.date(2018, 10, 28),
            dt.date(2022, 2, 21),
            dt.date(2019, 9, 5),
            dt.date(2020, 6, 15),
        ],
    }
)

df_filtered = pd.DataFrame()

for worker in workers:
    df_worker = event_df.query("person == @worker")
    worker_start = workers[worker][0]
    worker_end = workers[worker][1]
    df_worker = df_worker.query("@worker_start <= event_data <= @worker_end")
    df_filtered = pd.concat((df_filtered, df_worker), axis=0)

Output:

  event_id person  event_data
  1LFDVDX   Alan  2018-10-28
  5PEXVGH    Ben  2019-09-05