I have a Dataframe that sometimes contains 2 rows for what is, in reality, one entry. The way to identify these is:
- Columns: Not, Strike, Cents, SD, ED are identical
- Column ExecutionTimestamp is going to be within a short period of time (<2.5min)
- For a ( ) in Structure, there exists an offsetting (-)
Not | Strike | Cents | SD | ED | Structure | StartDate | EndDate | Index | DisseminationID | ExecutionTimestamp | EventTimestamp | FloatingRateResetFrequencyPeriod1 | TruncatedNotional | TDateID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 500 | 2.925 | 182.5 | 2 | 3 | ( ) | 2024-06-30 00:00:00 | 2025-06-30 00:00:00 | USD | 372521736 | 2022-06-28 13:41:01 | 2022-06-28 13:41:01 | 3M | False | 1 |
11 | 500 | 2.925 | 182.5 | 2 | 3 | ( ) | 2024-06-30 00:00:00 | 2025-06-30 00:00:00 | USD | 372530892 | 2022-06-28 13:41:01 | 2022-06-28 13:41:01 | 3M | False | 1 |
12 | 250 | 3.255 | 153.5 | 1 | 2 | ( ) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372539195 | 2022-06-28 14:36:15 | 2022-06-28 14:36:15 | 3M | False | 1 |
13 | 250 | 3.255 | 153.5 | 1 | 2 | ( ) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372533865 | 2022-06-28 14:37:11 | 2022-06-28 14:37:11 | 3M | False | 1 |
14 | 380 | 2.473 | 43 | 1 | 2 | (-) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372657384 | 2022-06-28 16:40:37 | 2022-06-28 16:40:37 | 3M | False | 1 |
15 | 380 | 2.473 | 43 | 1 | 2 | (-) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372657386 | 2022-06-28 16:40:37 | 2022-06-28 16:40:37 | 3M | False | 1 |
16 | 130 | 3.223 | 76 | 1 | 2 | ( ) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372656023 | 2022-06-28 16:41:00 | 2022-06-28 16:41:00 | 3M | False | 1 |
17 | 130 | 3.223 | 76 | 1 | 2 | ( ) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372656033 | 2022-06-28 16:41:00 | 2022-06-28 16:41:00 | 3M | False | 1 |
18 | 130 | 3.223 | 76 | 1 | 2 | (-) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372654200 | 2022-06-28 16:41:16 | 2022-06-28 16:41:16 | 3M | False | 1 |
19 | 130 | 3.223 | 76 | 1 | 2 | (-) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372654203 | 2022-06-28 16:41:16 | 2022-06-28 16:41:16 | 3M | False | 1 |
I am looking to combine these pairs of 2 whereby it will result in:
- Columns Not, Strike, SD, ED stay the same.
- Column Cents is added together
- Column Structure is replaced with (=)
- I dont care which timestamp is returned, whatever easy.
Not | Strike | Cents | SD | ED | Structure | StartDate | EndDate | Index | DisseminationID | ExecutionTimestamp | EventTimestamp | FloatingRateResetFrequencyPeriod1 | TruncatedNotional | TDateID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 500 | 2.925 | 182.5 | 2 | 3 | ( ) | 2024-06-30 00:00:00 | 2025-06-30 00:00:00 | USD | 372521736 | 2022-06-28 13:41:01 | 2022-06-28 13:41:01 | 3M | False | 1 |
11 | 500 | 2.925 | 182.5 | 2 | 3 | ( ) | 2024-06-30 00:00:00 | 2025-06-30 00:00:00 | USD | 372530892 | 2022-06-28 13:41:01 | 2022-06-28 13:41:01 | 3M | False | 1 |
12 | 250 | 3.255 | 153.5 | 1 | 2 | ( ) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372539195 | 2022-06-28 14:36:15 | 2022-06-28 14:36:15 | 3M | False | 1 |
13 | 250 | 3.255 | 153.5 | 1 | 2 | ( ) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372533865 | 2022-06-28 14:37:11 | 2022-06-28 14:37:11 | 3M | False | 1 |
14 | 380 | 2.473 | 43 | 1 | 2 | (-) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372657384 | 2022-06-28 16:40:37 | 2022-06-28 16:40:37 | 3M | False | 1 |
15 | 380 | 2.473 | 43 | 1 | 2 | (-) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372657386 | 2022-06-28 16:40:37 | 2022-06-28 16:40:37 | 3M | False | 1 |
16 | 130 | 3.223 | 152 | 1 | 2 | (=) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372656023 | 2022-06-28 16:41:00 | 2022-06-28 16:41:00 | 3M | False | 1 |
17 | 130 | 3.223 | 152 | 1 | 2 | (=) | 2023-06-30 00:00:00 | 2024-06-30 00:00:00 | USD | 372656033 | 2022-06-28 16:41:00 | 2022-06-28 16:41:00 | 3M | False | 1 |
CodePudding user response:
IIUC, you can use a custom group and a merge_asof
on ( ) with back insertion of the lone (-) values:
cols = ['A', 'B', 'C', 'D', 'E']
df['ExecutionTimestamp'] = pd.to_datetime(df['ExecutionTimestamp'])
# identify rows
m = df['F'].eq('( )')
# merge
out = (pd
.merge_asof(df[m].reset_index(), df[~m].reset_index(),
by=cols, on='ExecutionTimestamp',
direction='nearest', tolerance=pd.Timedelta('2.5min'),
suffixes=(None, '_')
)
.assign(F=lambda d: np.where(d['F_'].isna(), d['F'], '(=)'),
C=lambda d: np.where(d['F_'].isna(), d['C'], d['C']*2),
)
)
# add missing (lone) (-) value
missing = df.index.difference(out[['index', 'index_']].stack())
out = (pd
.concat([out.set_index('index'), df.loc[missing]])
.drop(columns=['index_', 'F_'])
)
output:
A B C D E F ExecutionTimestamp
15 130 3.123 77 0.975342 1.97808 ( ) 2022-06-28 13:41:00
16 130 3.123 154 0.975342 1.97808 (=) 2022-06-28 16:41:00
17 130 3.223 152 0.975342 1.97808 (=) 2022-06-28 16:41:00
CodePudding user response:
This should allow you to get your expected results.
df = df.groupby(['A', 'B', 'D', 'E']).agg({'C' : 'sum', 'ExecutionTimestamp' : 'last'}).reset_index()
df['F'] = '(=)'
df[['A', 'B', 'C', 'D', 'E', 'F', 'ExecutionTimestamp']]
I was sure what you meant by the "keep either timestamp" if you meant that you could manually choose which one of the timestamps you wanted or if you were willing to accept either. I just assumed you would want the 'last' one in my code, but you can change that to 'first' if you would prefer the first encounter value.