How can I group rows in dask based on a timestamp column and a name column whilst maintaining a cert-CodePudding

I'm trying to make a smaller daskframe from this larger one and the conditions are that if a suspect has been seen 5 or more times in the last 60 days I only want to retain those records and discard all the older ones. However if a suspect has not been seen 5 times yet, I would like to go back another 60 days in the past and check again until I have met my criteria of at least 5, if there are actually no more than 5 for a certain suspect in the whole dataset then I will just take whatever is there.

This is what I tried:

    time_sliced_df = []

    df = df.sort_values('suspect')
    print(df.groupby('suspect').aggregate())

    for counter, row in df.iterrows():
        print('{}/{}'.format(counter, len(df)))
        for days_back in range(0, 1095, 60):
            time_sliced_df.append(row) if row.date_sighted > (datetime.now() - timedelta(days=days_back)) else None
            if len(time_sliced_df) > 5:
                list_of_recent_suspects.append(time_sliced_df)
                list_of_recent_suspects= list(np.array(list_of_recent_suspects, dtype='object').flatten())
                break

    return list_of_recent_suspects

In the past when working with pandas I had some over-engineered multiprocessing approach that still took ages, I was really hoping leveraging dask would help me get past this and it has already helped so much where I am comfortable.

I'm not just asking for someone to dish out the code either, even just being told that I should use groupby or try some sort of dask/pandas like query etc.

       suspect date_sighted  
48         ABC 2016-05-12
54       PPPPP 2017-01-07
55         ABC 2017-01-18
57         CFG 2017-01-28
63       PPPPP 2017-03-03
...        ...        ...
3378       IOI 2019-08-14
3418      ZZZZ 2019-08-14
3472       IOI 2019-08-16
3607      ZZZZ 2019-08-19
3669       CFG 2019-08-20

Thank you!

CodePudding user response：

def segregate_by_last_60_days(df: dd.DataFrame) -> dd.DataFrame:
    for days_back in range(0, 1095, 60):
        placeholder_df: dd.DataFrame = df[df.date_sighted > (datetime.now() - timedelta(days=days_back))]

    if len(placeholder_df) > 5:
        return placeholder_df

return placeholder_df

def collate_dataset(df: dd.DataFrame) -> dd.DataFrame:
    return df.groupby('suspect').apply(segregate_by_last_60_days, meta=df.head(0))

This was my solution in the end, more performant than I would have ever expected.