Python pandas group by, transform multiple columns with custom conditions-CodePudding

I have dataframe containing 500k records and I would like to group-by multiple columns (data type of string and date) and later pick only few records inside each group based on custom condition.

Basically, I need to group the records (by first_roll_up, date, granular_timestamp) to check if the group contains any value for column top and if present, choose only the record with top value. Also, if the group doesn't contain any record with top value, choose all the records.

Input:

first_roll_up	sub	top	date	granular_timestamp	values
ABC		T1	2/10/2022	2/10/2022 10:00:00:000	.
ABC	SUB_A_1		2/10/2022	2/10/2022 10:00:00:000	.
ABC	SUB_A_2		2/10/2022	2/10/2022 10:00:00:000	.
ABC	SUB_A_3		2/10/2022	2/10/2022 10:00:00:000	.
XYZ	SUB_X_1		2/12/2022	2/10/2022 11:00:00:000	.
XYZ	SUB_X_2		2/12/2022	2/10/2022 11:00:00:000	.
XYZ	SUB_Y_1		2/12/2022	2/10/2022 12:00:00:000	.

Output:

first_roll_up	sub	top	date	granular_timestamp	values
ABC		T1	2/10/2022	2/10/2022 10:00:00:000	.
XYZ	SUB_X_1		2/12/2022	2/10/2022 11:00:00:000	.
XYZ	SUB_X_2		2/12/2022	2/10/2022 11:00:00:000	.
XYZ	SUB_Y_1		2/12/2022	2/10/2022 12:00:00:000	.

I tried to perform the below, but the function is taking 10 mins to complete. I tried transform instead of apply by adding new boolean column to identify groups, but it didn't help too.

df.groupby(['first_roll_up', 'sub', 'top', 'date', 'granular_timestamp'], sort=False)
        .apply(custom_function_to_filter_each_group_records)

CodePudding user response：

It doesn't seem like you need to use groupby() at all here, but rather some filtering and concatenating.

First, we'll create a dataframe which helps us identify whether a row has a top value of T1.

df_t1 = df[df['top'] == 'T1']

We now want to find all rows that do not contain any value of T1 and join it by first_roll_up. This seems to only occur when sub is missing, which makes your data appear to be hierarchical (e.g. first_roll_up --> sub).

df_want = pd.merge(df, df_t1, on='first_roll_up', how='outer', indicator=True, suffixes=('', '_drop')) \
            .query("_merge == 'left_only'") \
            .drop(columns='_merge')

This gives us a dataframe that looks like this:

  first_roll_up      sub  top  ... date_drop granular_timestamp_drop values_drop
4           XYZ  SUB_X_1       ...       NaN                     NaN         NaN
5           XYZ  SUB_X_2       ...       NaN                     NaN         NaN
6           XYZ  SUB_Y_1       ...       NaN                     NaN         NaN

Because of the way we joined, we have a set of columns that we don't care about. We've suffixed them with _drop. We'll remove these later.

Now we'll just stack our values of T1 on top of the dataframe we just created and drop the columns we don't want:

df_want = pd.concat([df_t1, df_want.loc[:, ~df_want.columns.str.endswith('_drop')]], ignore_index=True)

Final output:

  first_roll_up      sub top       date      granular_timestamp values
0           ABC           T1  2/10/2022  2/10/2022 10:00:00:000      .
1           XYZ  SUB_X_1      2/12/2022  2/10/2022 11:00:00:000      .
2           XYZ  SUB_X_2      2/12/2022  2/10/2022 11:00:00:000      .
3           XYZ  SUB_Y_1      2/12/2022  2/10/2022 12:00:00:000      .

CodePudding user response：

Another possible solution:

(df.groupby(['first_roll_up', 'date', 'granular_timestamp'], as_index=False)
.apply(lambda g: g if g.top.isna().all() else g.loc[g.top.notna()])
.reset_index(drop=True))

Output:

  first_roll_up       sub  top        date       granular_timestamp values
0          ABC        NaN  T1   2/10/2022   2/10/2022 10:00:00:000       .
1          XYZ   SUB_X_1   NaN  2/12/2022   2/10/2022 11:00:00:000       .
2          XYZ   SUB_X_2   NaN  2/12/2022   2/10/2022 11:00:00:000       .
3          XYZ   SUB_Y_1   NaN  2/12/2022   2/10/2022 12:00:00:000       .