I have dataframe containing 500k records and I would like to group-by multiple columns (data type of string and date) and later pick only few records inside each group based on custom condition.
Basically, I need to group the records (by first_roll_up
, date
, granular_timestamp
) to check if the group contains any value for column top
and if present, choose only the record with top
value. Also, if the group doesn't contain any record with top
value, choose all the records.
Input:
first_roll_up | sub | top | date | granular_timestamp | values |
---|---|---|---|---|---|
ABC | T1 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
ABC | SUB_A_1 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
ABC | SUB_A_2 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
ABC | SUB_A_3 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
XYZ | SUB_X_1 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_X_2 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_Y_1 | 2/12/2022 | 2/10/2022 12:00:00:000 | . |
Output:
first_roll_up | sub | top | date | granular_timestamp | values |
---|---|---|---|---|---|
ABC | T1 | 2/10/2022 | 2/10/2022 10:00:00:000 | . | |
XYZ | SUB_X_1 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_X_2 | 2/12/2022 | 2/10/2022 11:00:00:000 | . | |
XYZ | SUB_Y_1 | 2/12/2022 | 2/10/2022 12:00:00:000 | . |
I tried to perform the below, but the function is taking 10 mins to complete. I tried transform
instead of apply by adding new boolean column to identify groups, but it didn't help too.
df.groupby(['first_roll_up', 'sub', 'top', 'date', 'granular_timestamp'], sort=False)
.apply(custom_function_to_filter_each_group_records)
CodePudding user response:
It doesn't seem like you need to use groupby()
at all here, but rather some filtering and concatenating.
First, we'll create a dataframe which helps us identify whether a row has a top
value of T1
.
df_t1 = df[df['top'] == 'T1']
We now want to find all rows that do not contain any value of T1
and join it by first_roll_up
. This seems to only occur when sub
is missing, which makes your data appear to be hierarchical (e.g. first_roll_up
--> sub
).
df_want = pd.merge(df, df_t1, on='first_roll_up', how='outer', indicator=True, suffixes=('', '_drop')) \
.query("_merge == 'left_only'") \
.drop(columns='_merge')
This gives us a dataframe that looks like this:
first_roll_up sub top ... date_drop granular_timestamp_drop values_drop
4 XYZ SUB_X_1 ... NaN NaN NaN
5 XYZ SUB_X_2 ... NaN NaN NaN
6 XYZ SUB_Y_1 ... NaN NaN NaN
Because of the way we joined, we have a set of columns that we don't care about. We've suffixed them with _drop
. We'll remove these later.
Now we'll just stack our values of T1
on top of the dataframe we just created and drop the columns we don't want:
df_want = pd.concat([df_t1, df_want.loc[:, ~df_want.columns.str.endswith('_drop')]], ignore_index=True)
Final output:
first_roll_up sub top date granular_timestamp values
0 ABC T1 2/10/2022 2/10/2022 10:00:00:000 .
1 XYZ SUB_X_1 2/12/2022 2/10/2022 11:00:00:000 .
2 XYZ SUB_X_2 2/12/2022 2/10/2022 11:00:00:000 .
3 XYZ SUB_Y_1 2/12/2022 2/10/2022 12:00:00:000 .
CodePudding user response:
Another possible solution:
(df.groupby(['first_roll_up', 'date', 'granular_timestamp'], as_index=False)
.apply(lambda g: g if g.top.isna().all() else g.loc[g.top.notna()])
.reset_index(drop=True))
Output:
first_roll_up sub top date granular_timestamp values
0 ABC NaN T1 2/10/2022 2/10/2022 10:00:00:000 .
1 XYZ SUB_X_1 NaN 2/12/2022 2/10/2022 11:00:00:000 .
2 XYZ SUB_X_2 NaN 2/12/2022 2/10/2022 11:00:00:000 .
3 XYZ SUB_Y_1 NaN 2/12/2022 2/10/2022 12:00:00:000 .