Home > OS >  groupby and apply multiple conditions
groupby and apply multiple conditions

Time:02-05

This is a bit complicated but I will try to explain as best as I can.

i have the following dataframe.

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
1   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0xe5d84152dd961e2eb0d6c202cf3396f579974983  0x1111111254eeb25477b68fb85ed929f73a960582  1.052e 20   trace   ETH
2   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.052e 20   trace   ETH
3   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xe5d84152dd961e2eb0d6c202cf3396f579974983  1.0652365814992255e 20  transfer    stETH
4   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.0652365814992255e 20  transfer    stETH
5   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  6.391691160717606e 19   transfer    stETH
6   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0xdc24316b9ae028f1497c275eb9192a3ea0f67022  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  4.260674654274649e 19   transfer    stETH
7   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  1.401493579633375e 20   trace   ETH
8   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.401493579633375e 20   trace   ETH
9   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.419e 20   transfer    stETH
10  0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x7f39c581f595b53c5cb19bd0b3f8da6c935e2ca0  4.257e 19   transfer    stETH
11  0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0xdc24316b9ae028f1497c275eb9192a3ea0f67022  9.933e 19   transfer    stETH

These 11 transfers represent two swap transactions: 2 unique hash) between ETH and stETH. It'd be very nice if there was two clean transactions with one ETH going from A to B, and the other one stETH going from B to A. But, in decentralized exchanges, things work via routers which send multiple transactions through various addresses to complete one swap transaction.

Here, there are two transactions (as you can see, within one transaction (hash) made up of several transfers). I want to verify which ETH amounts corresponds to stETH. ETH and stETH prices are almost 1:1 so they should be quite close in value.

So from the first transaction (1-6), there is one ETH value (1.052e 20) but three different stETH value (1.0652365814992255e 20, 6.391691160717606e 19, and 4.260674654274649e 19). Obviously, it is clear that the pair that corresponds to the 1.052e 20 ETH swap is 1.0652365814992255e 20 stETH as it is the closest in value.

So, in order to filter out the right pair, I want to groupby transaction_hash, if there are more than 1 unique value of stETH, then I want to pick out the one that is closest to the ETH value.

so, the desired output will be:

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
1   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0xe5d84152dd961e2eb0d6c202cf3396f579974983  0x1111111254eeb25477b68fb85ed929f73a960582  1.052e 20   trace   ETH
2   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.052e 20   trace   ETH
3   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xe5d84152dd961e2eb0d6c202cf3396f579974983  1.0652365814992255e 20  transfer    stETH
4   0x00685b3aecf64de61bca7a7c7068c17879bb2a2f3ebfe65d4b9421b40ac63952  2023-01-02 03:12:59 00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.0652365814992255e 20  transfer    stETH
5   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0x1111111254eeb25477b68fb85ed929f73a960582  0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  1.401493579633375e 20   trace   ETH
6   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  0x1111111254eeb25477b68fb85ed929f73a960582  1.401493579633375e 20   trace   ETH
7   0x00a0ff958f99fabe8a6bde12304436ed6c43524d1ab12bced426abf3a507d939  2023-01-04 07:34:47 00:00   0xcb62961daac29b79ebac9a30e142da0e8ba8ead6  0x53222470cdcfb8081c0e3a50fd106f0d69e63f20  1.419e 20   transfer    stETH


Thanks!

EDIT!

I applied the code suggested below. Something strange is happening. My original df has a transaction like this

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
60347   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x558247e365be655f9144e1a0140d793984372ef3  6917030000000000.0  trace   ETH
61076   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0xb1720612d0131839dc489fcf20398ea925282fca  1220650000000000.0  trace   ETH
399307  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x6d02e95909da8da09865a26b62055bd6a1d5f706  8.4846e 17  trace   ETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e 17   transfer    stETH
625132  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 00:00   0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x4028daac072e492d34a3afdbef0ba7e35d8b55c4  8.800918863842962e 17   transfer    stETH

when I apply the code, instead of getting what I want which is..

399307  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x6d02e95909da8da09865a26b62055bd6a1d5f706  8.4846e 17  trace   ETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e 17   transfer    stETH
625132  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 00:00   0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x4028daac072e492d34a3afdbef0ba7e35d8b55c4  8.800918863842962e 17   transfer    stETH

I get more rows! and doesn't even remove the rows

    transaction_hash    block_timestamp from_address    to_address  value   data    token_address
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e 17   transfer    stETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e 17   transfer    stETH
30155   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 00:00   0x6d02e95909da8da09865a26b62055bd6a1d5f706  0xdef171fe48cf0115b1d80b88dc8eab59176fee57  8.800918863842962e 17   transfer    stETH
60347   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x558247e365be655f9144e1a0140d793984372ef3  6917030000000000.0  trace   ETH
61076   0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0xb1720612d0131839dc489fcf20398ea925282fca  1220650000000000.0  trace   ETH
399307  0x001d443681cebc7d9520b19bbd7b4d2ac090c366cb6a2541f46573035a1d5947  2022-05-26 09:57:49 UTC 0xdef171fe48cf0115b1d80b88dc8eab59176fee57  0x6d02e95909da8da09865a26b62055bd6a1d5f706  8.4846e 17  trace   ETH

what is going on?

CodePudding user response:

If you can sort your data by value you can use .merge_asof() to find the "nearest neighbor".

The by argument allows you specify grouping.

df = df.sort_values("value")

left = (
   df.loc[df["token_address"] == "ETH", ["transaction_hash", "value"]]
     .rename(columns={"value": "value_x"})
     .reset_index()
)

right = (
   df.loc[df["token_address"] != "ETH", ["transaction_hash", "value"]]
     .rename(columns={"value": "value_y"})
     .reset_index()
)

nearest = pd.merge_asof(
   left      = left,
   right     = right,
   by        = "transaction_hash", 
   left_on   = "value_x",
   right_on  = "value_y",
   direction = "nearest"
)
 
nearest["abs"] = (nearest["value_x"] - nearest["value_y"]).abs()
idxmin = nearest.groupby("value_y")["abs"].idxmin()

minrows = nearest.loc[idxmin, ["index_x", "index_y"]].to_numpy().ravel()
pairs = df.loc[minrows, ["transaction_hash", "value", "data", "token_address"]]

print(df.merge(pairs))
  • Related