I'm trying to make a smaller daskframe from this larger one and the conditions are that if a suspect has been seen 5 or more times in the last 60 days I only want to retain those records and discard all the older ones. However if a suspect has not been seen 5 times yet, I would like to go back another 60 days in the past and check again until I have met my criteria of at least 5, if there are actually no more than 5 for a certain suspect in the whole dataset then I will just take whatever is there.
This is what I tried:
time_sliced_df = []
df = df.sort_values('suspect')
print(df.groupby('suspect').aggregate())
for counter, row in df.iterrows():
print('{}/{}'.format(counter, len(df)))
for days_back in range(0, 1095, 60):
time_sliced_df.append(row) if row.date_sighted > (datetime.now() - timedelta(days=days_back)) else None
if len(time_sliced_df) > 5:
list_of_recent_suspects.append(time_sliced_df)
list_of_recent_suspects= list(np.array(list_of_recent_suspects, dtype='object').flatten())
break
return list_of_recent_suspects
In the past when working with pandas I had some over-engineered multiprocessing approach that still took ages, I was really hoping leveraging dask would help me get past this and it has already helped so much where I am comfortable.
I'm not just asking for someone to dish out the code either, even just being told that I should use groupby
or try some sort of dask/pandas like query etc.
suspect date_sighted
48 ABC 2016-05-12
54 PPPPP 2017-01-07
55 ABC 2017-01-18
57 CFG 2017-01-28
63 PPPPP 2017-03-03
... ... ...
3378 IOI 2019-08-14
3418 ZZZZ 2019-08-14
3472 IOI 2019-08-16
3607 ZZZZ 2019-08-19
3669 CFG 2019-08-20
Thank you!
CodePudding user response:
def segregate_by_last_60_days(df: dd.DataFrame) -> dd.DataFrame:
for days_back in range(0, 1095, 60):
placeholder_df: dd.DataFrame = df[df.date_sighted > (datetime.now() - timedelta(days=days_back))]
if len(placeholder_df) > 5:
return placeholder_df
return placeholder_df
def collate_dataset(df: dd.DataFrame) -> dd.DataFrame:
return df.groupby('suspect').apply(segregate_by_last_60_days, meta=df.head(0))
This was my solution in the end, more performant than I would have ever expected.