I have a dataframe with Ethereum transactions and relative internal execution calls. The data looks like the following:
{
"BLOCK": 12003477,
"TIMESTAMP": "2021-03-09 09:21:58.000",
"TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
"CALL_ID": "\\N",
"CALL_TYPE": "call",
"FROM_ADDRESS": "0x45ae5ae57a0f4e6250f782fc6f0daec5dead27e4",
"TO_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
"FUNCTION_NAME": "createEstateWithMetadata",
"ORDER_INDEX": 3859
},
{
"BLOCK": 12003477,
"TIMESTAMP": "2021-03-09 09:21:58.000",
"TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
"CALL_ID": "0",
"CALL_TYPE": "delegatecall",
"FROM_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
"TO_ADDRESS": "0xa57e126b341b18c262ad25b86bb4f65b5e2ade45",
"FUNCTION_NAME": "createEstateWithMetadata",
"ORDER_INDEX": 3861
},
{
"BLOCK": 12003477,
"TIMESTAMP": "2021-03-09 09:21:58.000",
"TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
"CALL_ID": "0_0",
"CALL_TYPE": "call",
"FROM_ADDRESS": "0xa57e126b341b18c262ad25b86bb4f65b5e2ade45",
"TO_ADDRESS": "0x959e104e1a4db6317fa58f8295f586e1a978c297",
"FUNCTION_NAME": "mint",
"ORDER_INDEX": 3863
},
{
"BLOCK": 12003477,
"TIMESTAMP": "2021-03-09 09:21:58.000",
"TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
"CALL_ID": "0_0_0",
"CALL_TYPE": "delegatecall",
"FROM_ADDRESS": "0x959e104e1a4db6317fa58f8295f586e1a978c297",
"TO_ADDRESS": "0x1784ef41af86e97f8d28afe95b573a24aeda966e",
"FUNCTION_NAME": "mint",
"ORDER_INDEX": 3866
}
Records with the same TX_HASH
value belong to the same execution track. Among them, the one with CALL_ID = \\N
represents the transaction (etherscan), and the others are the internal calls. On the internal calls records the CALL_ID
field can get values like 0
, 0_0
, 0_1
, 0_0_0
, 0_1_0
, and so on... Each number after the underscore indicates a deeper level in the execution trace. The record with \\N
is the actual transaction and the others represent the function executed internally by a smart contract during the execution.
Inside the dataframe, there are thousands of different transactions, and, so, thousand of different TX_HASH
values. For each different hash value, there can be 1 or more records representing the internal calls.
Moreover, as you can see from the data sample, there is also a FUNCTION_NAME
which indicates the function executed by the smart contract. In the case of \\N
it indicates the top-level function, and on 0
, 0_0
indicates the internal functions.
What I need to do is to create different dataframes for each top-level FUNCTION_NAME
, which means each different FUNCTION_NAME
value on records with CALL_ID = \\N
. Considering createEstateWithMetadata
above, there should be a dataframe that contains the records that has FUNCTION_NAME = createEstateWithMetadata
and CALL_ID = \\N
, then considering the hashes of such records, we need to identify the internal calls (they have the same hash) and add them into the dataframe.
For instance, the dataframe for createEstateWithMetadata
should contain the records contained in the data sample above. In the initial dataframe, there are more than 20 different values for FUNCTION_NAME
when CALL_ID = \\N
. An example is setApprovalForAll
below which will lead to a new dataframe:
{
"BLOCK": 12019039,
"TIMESTAMP": "2021-03-11 19:03:09.000",
"TX_HASH": "0xdabadcba08961d6fdffbdee0a5f801dd59c098b65c41087d95a608065ade712b",
"CALL_ID": "\\N",
"CALL_TYPE": "call",
"FROM_ADDRESS": "0x3b38a52998a4b786638e774007f3f8ab34db792e",
"TO_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
"FUNCTION_NAME": "setApprovalForAll",
"ORDER_INDEX": 1620
},
{
"BLOCK": 12019039,
"TIMESTAMP": "2021-03-11 19:03:09.000",
"TX_HASH": "0xdabadcba08961d6fdffbdee0a5f801dd59c098b65c41087d95a608065ade712b",
"CALL_ID": "0",
"CALL_TYPE": "delegatecall",
"FROM_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
"TO_ADDRESS": "0xa57e126b341b18c262ad25b86bb4f65b5e2ade45",
"FUNCTION_NAME": "approve",
"ORDER_INDEX": 1622
}
My idea was to first group by TX_HASH
to get the "traces" of execution and then get the FUNCTION_NAME
value for the first record of each group (which should be the one with CALL_ID = \\N
) to create the dataframes and iteratively put each group in the correct dataframe. I believe that this could be the solution.
CodePudding user response:
This seems to do the trick: create a new column where the CALL_ID
is equal to "\\N"
, then group by TX_HASH
and map the TX_HASH
to the first value of the newly created column. (As there can only be one CALL_ID=="\\N"
per TX_HASH
.
df["NEW_HASH_GROUP"] = (df.CALL_ID == "\\N") * df.FUNCTION_NAME
df["GROUP"] = df.TX_HASH.map(df.groupby("TX_HASH").NEW_HASH_GROUP.first())
# Split each group in its own DataFrame
dfs = [f for _, f in df.groupby(["GROUP"])]
If I do:
for df in dfs:
print(df.GROUP.unique())
# ['createEstateWithMetadata']
# ['setApprovalForAll']