Home > Back-end >  Pandas - Grouping related data entries, with imperfect data
Pandas - Grouping related data entries, with imperfect data

Time:07-06

This likely isn't an easy question. I am looking at public trade data, and trying to group 'related' items. But the data itself isn't perfect (the human element) or is misleading. The good news, is that related items generally follow similar patterns.

Here's an example

Notional Premium Strike Structure StartDate EndDate Index ExecutionTimestamp ResetFreq
10 500,000,000 9,125,000 0.02925 Call 2024-06-30 2025-06-30 SOFR-OIS 2022-06-28T13:41:01 3
11 500,000,000 9,125,000 0.02925 Call 2024-06-30 2025-06-30 SOFR-OIS 2022-06-28T13:41:01 3
12 250,000,000 3,837,500 0.03255 Call 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T14:36:15 3
13 250,000,000 3,837,500 0.03255 Call 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T14:37:11 3
14 380,000,000 1,633,999.99462 0.02473 Put 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T16:40:37 3
15 380,000,000 1,633,999.99462 0.02473 Put 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T16:40:37 3
16 130,000,000 987,999.99952 0.03223 Call 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T16:41:00 3
17 130,000,000 987,999.99952 0.03223 Call 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T16:41:00 3
18 130,000,000 987,999.99952 0.03223 Put 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T16:41:16 3
19 130,000,000 987,999.99952 0.03223 Put 2023-06-30 2024-06-30 SOFR-OIS 2022-06-28T16:41:16 3

A lot of real similar looking things here! One specific note on the columns:

Structure is misleading. In reality, there are 3 possible values [Call, Put, Straddle]. However, it seems like Put = Put, but Call = Call OR Straddle in the data. A straddle = 1 Put 1 Call with the same notional, dates and strike.

So in the example I can see that index 16, 17, 18, 19 are very likely to be 2 straddle trades [and I know this to be true]. However, index 14 & 15 are also related to these two straddle trades! The way I would know this is:

  1. Timestamp is very close in time (<5min-ish)
  2. The difference in (strike*10000) between 16/17/18/19 and 14/15 is divisible by a factor of 25 [people like round numbers].
  3. StartDate, EndDate, Index, Reset Freq match as well.

In this case (14,15,16,17,18,19) I would want to return 2x “1 straddle vs 1 put trade with details xyz”. In pandas terms, I would convert the put call into a straddle where applicable, and be left with two “Group” IDs to identify related trade.

So in general, I guess I'm asking if there is a way to create a list of rules that determine the probability of it being a related trade? I'm open to any suggestions though!

I can edit the question for more specific rules etc if needed.

Tx!

CodePudding user response:

so this actually might be more of a statistical issue than anything a programming issue. It sounds like you want to measure the distance of similarity between items, and if they are very similar, group them together. Distance metrics like the Gowers Distance could work for you, as they allow to "measure" the distance between continious variables like time, and categorical variables like structure. For more information you can look here. https://medium.com/analytics-vidhya/gowers-distance-899f9c4bd553. If you measure the gowers metric between all consecutive items and design a relevant threshold, then you should be golden. For finding the threshold and designing good groups, methods like Otsu Thresholding can be interesting for you. https://scikit-image.org/docs/dev/auto_examples/segmentation/plot_thresholding.html

If you want to stick to rule base, you can also create a binary indicator for all the categorical variables that represent whether they are equal to the previous row, and measure the actual distance like time delay for the continuous variables. From there you can design arbitrary thresholds for the continuous data (and measure difference) and aggregate the similarity of the binary indicators (simple sum would do, after all True = 1 and False = 0) Whether the output is legit, is hard to qualify.

There are libraries that can do the heavy lifting for you, usually based on fuzzy matching. An example is dedupe https://github.com/dedupeio/dedupe. Whether this would fit your need I dont know.

  • Related