This is a bit complicated but I will try to explain as best as I can.
i have the following dataframe.
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e 20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e 20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e 20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e 20 transfer stETH
5 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 6.391691160717606e 19 transfer stETH
6 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 4.260674654274649e 19 transfer stETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e 20 trace ETH
8 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e 20 trace ETH
9 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e 20 transfer stETH
10 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0 4.257e 19 transfer stETH
11 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0xdc24316b9ae028f1497c275eb9192a3ea0f67022 9.933e 19 transfer stETH
These 11 transfers represent two swap transactions: 2 unique hash) between ETH and stETH. It'd be very nice if there was two clean transactions with one ETH going from A to B, and the other one stETH going from B to A. But, in decentralized exchanges, things work via routers which send multiple transactions through various addresses to complete one swap transaction.
Here, there are two transactions (as you can see, within one transaction (hash) made up of several transfers). I want to verify which ETH amounts corresponds to stETH. ETH and stETH prices are almost 1:1 so they should be quite close in value.
So from the first transaction (1-6), there is one ETH value (1.052e 20) but three different stETH value (1.0652365814992255e 20, 6.391691160717606e 19, and 4.260674654274649e 19). Obviously, it is clear that the pair that corresponds to the 1.052e 20 ETH swap is 1.0652365814992255e 20 stETH as it is the closest in value.
So, in order to filter out the right pair, I want to groupby transaction_hash
, if there are more than 1 unique value of stETH, then I want to pick out the one that is closest to the ETH value.
so, the desired output will be:
transaction_hash block_timestamp from_address to_address value data token_address
1 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0xe5d84152dd961e2eb0d6c202cf3396f579974983 0x1111111254eeb25477b68fb85ed929f73a960582 1.052e 20 trace ETH
2 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.052e 20 trace ETH
3 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xe5d84152dd961e2eb0d6c202cf3396f579974983 1.0652365814992255e 20 transfer stETH
4 0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952 2023-01-02 03:12:59 00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.0652365814992255e 20 transfer stETH
5 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0x1111111254eeb25477b68fb85ed929f73a960582 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 1.401493579633375e 20 trace ETH
6 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 0x1111111254eeb25477b68fb85ed929f73a960582 1.401493579633375e 20 trace ETH
7 0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939 2023-01-04 07:34:47 00:00 0xcb62961daac29b79ebac9a30e142da0e8ba8ead6 0x53222470cdcfb8081c0e3a50fd106f0d69e63f20 1.419e 20 transfer stETH
Thanks!
EDIT!
I applied the code suggested below. Something strange is happening. My original df has a transaction like this
transaction_hash block_timestamp from_address to_address value data token_address
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e 17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e 17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e 17 transfer stETH
when I apply the code, instead of getting what I want which is..
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e 17 trace ETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e 17 transfer stETH
625132 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 00:00 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x4028daac072e492d34a3afdbef0ba7e35d8b55c4 8.800918863842962e 17 transfer stETH
I get more rows! and doesn't even remove the rows
transaction_hash block_timestamp from_address to_address value data token_address
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e 17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e 17 transfer stETH
30155 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 00:00 0x6d02e95909da8da09865a26b62055bd6a1d5f706 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 8.800918863842962e 17 transfer stETH
60347 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x558247e365be655f9144e1a0140d793984372ef3 6917030000000000.0 trace ETH
61076 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0xb1720612d0131839dc489fcf20398ea925282fca 1220650000000000.0 trace ETH
399307 0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947 2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57 0x6d02e95909da8da09865a26b62055bd6a1d5f706 8.4846e 17 trace ETH
what is going on?
CodePudding user response:
If you can sort your data by value
you can use .merge_asof()
to find the "nearest neighbor".
The by
argument allows you specify grouping.
df = df.sort_values("value")
left = (
df.loc[df["token_address"] == "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_x"})
.reset_index()
)
right = (
df.loc[df["token_address"] != "ETH", ["transaction_hash", "value"]]
.rename(columns={"value": "value_y"})
.reset_index()
)
nearest = pd.merge_asof(
left = left,
right = right,
by = "transaction_hash",
left_on = "value_x",
right_on = "value_y",
direction = "nearest"
)
nearest["abs"] = (nearest["value_x"] - nearest["value_y"]).abs()
idxmin = nearest.groupby("value_y")["abs"].idxmin()
minrows = nearest.loc[idxmin, ["index_x", "index_y"]].to_numpy().ravel()
pairs = df.loc[minrows, ["transaction_hash", "value", "data", "token_address"]]
print(df.merge(pairs))