Pandas - Grouping related data entries, with imperfect data-CodePudding

This likely isn't an easy question. I am looking at public trade data, and trying to group 'related' items. But the data itself isn't perfect (the human element) or is misleading. The good news, is that related items generally follow similar patterns.

Here's an example

	Notional	Premium	Strike	Structure	StartDate	EndDate	Index	ExecutionTimestamp	ResetFreq
10	500,000,000	9,125,000	0.02925	Call	2024-06-30	2025-06-30	SOFR-OIS	2022-06-28T13:41:01	3
11	500,000,000	9,125,000	0.02925	Call	2024-06-30	2025-06-30	SOFR-OIS	2022-06-28T13:41:01	3
12	250,000,000	3,837,500	0.03255	Call	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T14:36:15	3
13	250,000,000	3,837,500	0.03255	Call	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T14:37:11	3
14	380,000,000	1,633,999.99462	0.02473	Put	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T16:40:37	3
15	380,000,000	1,633,999.99462	0.02473	Put	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T16:40:37	3
16	130,000,000	987,999.99952	0.03223	Call	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T16:41:00	3
17	130,000,000	987,999.99952	0.03223	Call	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T16:41:00	3
18	130,000,000	987,999.99952	0.03223	Put	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T16:41:16	3
19	130,000,000	987,999.99952	0.03223	Put	2023-06-30	2024-06-30	SOFR-OIS	2022-06-28T16:41:16	3

A lot of real similar looking things here! One specific note on the columns:

Structure is misleading. In reality, there are 3 possible values [Call, Put, Straddle]. However, it seems like Put = Put, but Call = Call OR Straddle in the data. A straddle = 1 Put 1 Call with the same notional, dates and strike.

So in the example I can see that index 16, 17, 18, 19 are very likely to be 2 straddle trades [and I know this to be true]. However, index 14 & 15 are also related to these two straddle trades! The way I would know this is:

Timestamp is very close in time (<5min-ish)
The difference in (strike*10000) between 16/17/18/19 and 14/15 is divisible by a factor of 25 [people like round numbers].
StartDate, EndDate, Index, Reset Freq match as well.

In this case (14,15,16,17,18,19) I would want to return 2x “1 straddle vs 1 put trade with details xyz”. In pandas terms, I would convert the put call into a straddle where applicable, and be left with two “Group” IDs to identify related trade.

So in general, I guess I'm asking if there is a way to create a list of rules that determine the probability of it being a related trade? I'm open to any suggestions though!

I can edit the question for more specific rules etc if needed.

Tx!

CodePudding user response：

so this actually might be more of a statistical issue than anything a programming issue. It sounds like you want to measure the distance of similarity between items, and if they are very similar, group them together. Distance metrics like the Gowers Distance could work for you, as they allow to "measure" the distance between continious variables like time, and categorical variables like structure. For more information you can look here. https://medium.com/analytics-vidhya/gowers-distance-899f9c4bd553. If you measure the gowers metric between all consecutive items and design a relevant threshold, then you should be golden. For finding the threshold and designing good groups, methods like Otsu Thresholding can be interesting for you. https://scikit-image.org/docs/dev/auto_examples/segmentation/plot_thresholding.html

If you want to stick to rule base, you can also create a binary indicator for all the categorical variables that represent whether they are equal to the previous row, and measure the actual distance like time delay for the continuous variables. From there you can design arbitrary thresholds for the continuous data (and measure difference) and aggregate the similarity of the binary indicators (simple sum would do, after all True = 1 and False = 0) Whether the output is legit, is hard to qualify.

There are libraries that can do the heavy lifting for you, usually based on fuzzy matching. An example is dedupe https://github.com/dedupeio/dedupe. Whether this would fit your need I dont know.