Pandas - Group / Aggregate rows based on duplication AND the existence of an opposite-CodePudding

I have a Dataframe that sometimes contains 2 rows for what is, in reality, one entry. The way to identify these is:

Columns: Not, Strike, Cents, SD, ED are identical
Column ExecutionTimestamp is going to be within a short period of time (<2.5min)
For a ( ) in Structure, there exists an offsetting (-)

	Not	Strike	Cents	SD	ED	Structure	StartDate	EndDate	Index	DisseminationID	ExecutionTimestamp	EventTimestamp	FloatingRateResetFrequencyPeriod1	TruncatedNotional	TDateID
10	500	2.925	182.5	2	3	( )	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372521736	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
11	500	2.925	182.5	2	3	( )	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372530892	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
12	250	3.255	153.5	1	2	( )	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372539195	2022-06-28 14:36:15	2022-06-28 14:36:15	3M	False	1
13	250	3.255	153.5	1	2	( )	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372533865	2022-06-28 14:37:11	2022-06-28 14:37:11	3M	False	1
14	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657384	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
15	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657386	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
16	130	3.223	76	1	2	( )	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656023	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1
17	130	3.223	76	1	2	( )	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656033	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1
18	130	3.223	76	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372654200	2022-06-28 16:41:16	2022-06-28 16:41:16	3M	False	1
19	130	3.223	76	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372654203	2022-06-28 16:41:16	2022-06-28 16:41:16	3M	False	1

I am looking to combine these pairs of 2 whereby it will result in:

Columns Not, Strike, SD, ED stay the same.
Column Cents is added together
Column Structure is replaced with (=)
I dont care which timestamp is returned, whatever easy.

	Not	Strike	Cents	SD	ED	Structure	StartDate	EndDate	Index	DisseminationID	ExecutionTimestamp	EventTimestamp	FloatingRateResetFrequencyPeriod1	TruncatedNotional	TDateID
10	500	2.925	182.5	2	3	( )	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372521736	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
11	500	2.925	182.5	2	3	( )	2024-06-30 00:00:00	2025-06-30 00:00:00	USD	372530892	2022-06-28 13:41:01	2022-06-28 13:41:01	3M	False	1
12	250	3.255	153.5	1	2	( )	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372539195	2022-06-28 14:36:15	2022-06-28 14:36:15	3M	False	1
13	250	3.255	153.5	1	2	( )	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372533865	2022-06-28 14:37:11	2022-06-28 14:37:11	3M	False	1
14	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657384	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
15	380	2.473	43	1	2	(-)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372657386	2022-06-28 16:40:37	2022-06-28 16:40:37	3M	False	1
16	130	3.223	152	1	2	(=)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656023	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1
17	130	3.223	152	1	2	(=)	2023-06-30 00:00:00	2024-06-30 00:00:00	USD	372656033	2022-06-28 16:41:00	2022-06-28 16:41:00	3M	False	1

CodePudding user response：

IIUC, you can use a custom group and a merge_asof on ( ) with back insertion of the lone (-) values:

cols = ['A', 'B', 'C', 'D', 'E']

df['ExecutionTimestamp'] = pd.to_datetime(df['ExecutionTimestamp'])

# identify   rows
m = df['F'].eq('( )')

# merge
out = (pd
  .merge_asof(df[m].reset_index(), df[~m].reset_index(),
              by=cols, on='ExecutionTimestamp',
              direction='nearest', tolerance=pd.Timedelta('2.5min'),
              suffixes=(None, '_')
             )
  .assign(F=lambda d: np.where(d['F_'].isna(), d['F'], '(=)'),
          C=lambda d: np.where(d['F_'].isna(), d['C'], d['C']*2), 
         )
 )

# add missing (lone) (-) value
missing = df.index.difference(out[['index', 'index_']].stack())

out = (pd
       .concat([out.set_index('index'), df.loc[missing]])
       .drop(columns=['index_', 'F_'])
      )

output:

      A      B    C         D        E    F  ExecutionTimestamp
15  130  3.123   77  0.975342  1.97808  ( ) 2022-06-28 13:41:00
16  130  3.123  154  0.975342  1.97808  (=) 2022-06-28 16:41:00
17  130  3.223  152  0.975342  1.97808  (=) 2022-06-28 16:41:00

CodePudding user response：

This should allow you to get your expected results.

df = df.groupby(['A', 'B', 'D', 'E']).agg({'C' : 'sum', 'ExecutionTimestamp' : 'last'}).reset_index()
df['F'] = '(=)'
df[['A', 'B', 'C', 'D', 'E', 'F', 'ExecutionTimestamp']]

I was sure what you meant by the "keep either timestamp" if you meant that you could manually choose which one of the timestamps you wanted or if you were willing to accept either. I just assumed you would want the 'last' one in my code, but you can change that to 'first' if you would prefer the first encounter value.