Home > Software design >  Pandas - Group / Aggregate rows based on duplication AND the existence of an opposite
Pandas - Group / Aggregate rows based on duplication AND the existence of an opposite

Time:07-07

I have a Dataframe that sometimes contains 2 rows for what is, in reality, one entry. The way to identify these is:

  1. Columns: Not, Strike, Cents, SD, ED are identical
  2. Column ExecutionTimestamp is going to be within a short period of time (<2.5min)
  3. For a ( ) in Structure, there exists an offsetting (-)
Not Strike Cents SD ED Structure StartDate EndDate Index DisseminationID ExecutionTimestamp EventTimestamp FloatingRateResetFrequencyPeriod1 TruncatedNotional TDateID
10 500 2.925 182.5 2 3 ( ) 2024-06-30 00:00:00 2025-06-30 00:00:00 USD 372521736 2022-06-28 13:41:01 2022-06-28 13:41:01 3M False 1
11 500 2.925 182.5 2 3 ( ) 2024-06-30 00:00:00 2025-06-30 00:00:00 USD 372530892 2022-06-28 13:41:01 2022-06-28 13:41:01 3M False 1
12 250 3.255 153.5 1 2 ( ) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372539195 2022-06-28 14:36:15 2022-06-28 14:36:15 3M False 1
13 250 3.255 153.5 1 2 ( ) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372533865 2022-06-28 14:37:11 2022-06-28 14:37:11 3M False 1
14 380 2.473 43 1 2 (-) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372657384 2022-06-28 16:40:37 2022-06-28 16:40:37 3M False 1
15 380 2.473 43 1 2 (-) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372657386 2022-06-28 16:40:37 2022-06-28 16:40:37 3M False 1
16 130 3.223 76 1 2 ( ) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372656023 2022-06-28 16:41:00 2022-06-28 16:41:00 3M False 1
17 130 3.223 76 1 2 ( ) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372656033 2022-06-28 16:41:00 2022-06-28 16:41:00 3M False 1
18 130 3.223 76 1 2 (-) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372654200 2022-06-28 16:41:16 2022-06-28 16:41:16 3M False 1
19 130 3.223 76 1 2 (-) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372654203 2022-06-28 16:41:16 2022-06-28 16:41:16 3M False 1

I am looking to combine these pairs of 2 whereby it will result in:

  1. Columns Not, Strike, SD, ED stay the same.
  2. Column Cents is added together
  3. Column Structure is replaced with (=)
  4. I dont care which timestamp is returned, whatever easy.
Not Strike Cents SD ED Structure StartDate EndDate Index DisseminationID ExecutionTimestamp EventTimestamp FloatingRateResetFrequencyPeriod1 TruncatedNotional TDateID
10 500 2.925 182.5 2 3 ( ) 2024-06-30 00:00:00 2025-06-30 00:00:00 USD 372521736 2022-06-28 13:41:01 2022-06-28 13:41:01 3M False 1
11 500 2.925 182.5 2 3 ( ) 2024-06-30 00:00:00 2025-06-30 00:00:00 USD 372530892 2022-06-28 13:41:01 2022-06-28 13:41:01 3M False 1
12 250 3.255 153.5 1 2 ( ) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372539195 2022-06-28 14:36:15 2022-06-28 14:36:15 3M False 1
13 250 3.255 153.5 1 2 ( ) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372533865 2022-06-28 14:37:11 2022-06-28 14:37:11 3M False 1
14 380 2.473 43 1 2 (-) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372657384 2022-06-28 16:40:37 2022-06-28 16:40:37 3M False 1
15 380 2.473 43 1 2 (-) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372657386 2022-06-28 16:40:37 2022-06-28 16:40:37 3M False 1
16 130 3.223 152 1 2 (=) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372656023 2022-06-28 16:41:00 2022-06-28 16:41:00 3M False 1
17 130 3.223 152 1 2 (=) 2023-06-30 00:00:00 2024-06-30 00:00:00 USD 372656033 2022-06-28 16:41:00 2022-06-28 16:41:00 3M False 1

CodePudding user response:

IIUC, you can use a custom group and a merge_asof on ( ) with back insertion of the lone (-) values:

cols = ['A', 'B', 'C', 'D', 'E']

df['ExecutionTimestamp'] = pd.to_datetime(df['ExecutionTimestamp'])

# identify   rows
m = df['F'].eq('( )')

# merge
out = (pd
  .merge_asof(df[m].reset_index(), df[~m].reset_index(),
              by=cols, on='ExecutionTimestamp',
              direction='nearest', tolerance=pd.Timedelta('2.5min'),
              suffixes=(None, '_')
             )
  .assign(F=lambda d: np.where(d['F_'].isna(), d['F'], '(=)'),
          C=lambda d: np.where(d['F_'].isna(), d['C'], d['C']*2), 
         )
 )

# add missing (lone) (-) value
missing = df.index.difference(out[['index', 'index_']].stack())

out = (pd
       .concat([out.set_index('index'), df.loc[missing]])
       .drop(columns=['index_', 'F_'])
      )

output:

      A      B    C         D        E    F  ExecutionTimestamp
15  130  3.123   77  0.975342  1.97808  ( ) 2022-06-28 13:41:00
16  130  3.123  154  0.975342  1.97808  (=) 2022-06-28 16:41:00
17  130  3.223  152  0.975342  1.97808  (=) 2022-06-28 16:41:00

CodePudding user response:

This should allow you to get your expected results.

df = df.groupby(['A', 'B', 'D', 'E']).agg({'C' : 'sum', 'ExecutionTimestamp' : 'last'}).reset_index()
df['F'] = '(=)'
df[['A', 'B', 'C', 'D', 'E', 'F', 'ExecutionTimestamp']]

I was sure what you meant by the "keep either timestamp" if you meant that you could manually choose which one of the timestamps you wanted or if you were willing to accept either. I just assumed you would want the 'last' one in my code, but you can change that to 'first' if you would prefer the first encounter value.

  • Related