Pandas - aggregate group by dataframe-CodePudding

I have a dataframe with Ethereum transactions and relative internal execution calls. The data looks like the following:

  {
    "BLOCK": 12003477,
    "TIMESTAMP": "2021-03-09 09:21:58.000",
    "TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
    "CALL_ID": "\\N",
    "CALL_TYPE": "call",
    "FROM_ADDRESS": "0x45ae5ae57a0f4e6250f782fc6f0daec5dead27e4",
    "TO_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
    "FUNCTION_NAME": "createEstateWithMetadata",
    "ORDER_INDEX": 3859
  },
  {
    "BLOCK": 12003477,
    "TIMESTAMP": "2021-03-09 09:21:58.000",
    "TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
    "CALL_ID": "0",
    "CALL_TYPE": "delegatecall",
    "FROM_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
    "TO_ADDRESS": "0xa57e126b341b18c262ad25b86bb4f65b5e2ade45",
    "FUNCTION_NAME": "createEstateWithMetadata",
    "ORDER_INDEX": 3861
  },
  {
    "BLOCK": 12003477,
    "TIMESTAMP": "2021-03-09 09:21:58.000",
    "TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
    "CALL_ID": "0_0",
    "CALL_TYPE": "call",
    "FROM_ADDRESS": "0xa57e126b341b18c262ad25b86bb4f65b5e2ade45",
    "TO_ADDRESS": "0x959e104e1a4db6317fa58f8295f586e1a978c297",
    "FUNCTION_NAME": "mint",
    "ORDER_INDEX": 3863
  },
  {
    "BLOCK": 12003477,
    "TIMESTAMP": "2021-03-09 09:21:58.000",
    "TX_HASH": "0x78fb79172b623efeebcc07d5dbf734e40890e56e8d3783ffc82b24618682f594",
    "CALL_ID": "0_0_0",
    "CALL_TYPE": "delegatecall",
    "FROM_ADDRESS": "0x959e104e1a4db6317fa58f8295f586e1a978c297",
    "TO_ADDRESS": "0x1784ef41af86e97f8d28afe95b573a24aeda966e",
    "FUNCTION_NAME": "mint",
    "ORDER_INDEX": 3866
  }

Records with the same TX_HASH value belong to the same execution track. Among them, the one with CALL_ID = \\N represents the transaction (etherscan), and the others are the internal calls. On the internal calls records the CALL_ID field can get values like 0, 0_0, 0_1, 0_0_0, 0_1_0, and so on... Each number after the underscore indicates a deeper level in the execution trace. The record with \\N is the actual transaction and the others represent the function executed internally by a smart contract during the execution.

Inside the dataframe, there are thousands of different transactions, and, so, thousand of different TX_HASH values. For each different hash value, there can be 1 or more records representing the internal calls.

Moreover, as you can see from the data sample, there is also a FUNCTION_NAME which indicates the function executed by the smart contract. In the case of \\N it indicates the top-level function, and on 0, 0_0 indicates the internal functions.

What I need to do is to create different dataframes for each top-level FUNCTION_NAME, which means each different FUNCTION_NAME value on records with CALL_ID = \\N. Considering createEstateWithMetadata above, there should be a dataframe that contains the records that has FUNCTION_NAME = createEstateWithMetadata and CALL_ID = \\N, then considering the hashes of such records, we need to identify the internal calls (they have the same hash) and add them into the dataframe.

For instance, the dataframe for createEstateWithMetadata should contain the records contained in the data sample above. In the initial dataframe, there are more than 20 different values for FUNCTION_NAME when CALL_ID = \\N. An example is setApprovalForAll below which will lead to a new dataframe:

  {
    "BLOCK": 12019039,
    "TIMESTAMP": "2021-03-11 19:03:09.000",
    "TX_HASH": "0xdabadcba08961d6fdffbdee0a5f801dd59c098b65c41087d95a608065ade712b",
    "CALL_ID": "\\N",
    "CALL_TYPE": "call",
    "FROM_ADDRESS": "0x3b38a52998a4b786638e774007f3f8ab34db792e",
    "TO_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
    "FUNCTION_NAME": "setApprovalForAll",
    "ORDER_INDEX": 1620
  },
  {
    "BLOCK": 12019039,
    "TIMESTAMP": "2021-03-11 19:03:09.000",
    "TX_HASH": "0xdabadcba08961d6fdffbdee0a5f801dd59c098b65c41087d95a608065ade712b",
    "CALL_ID": "0",
    "CALL_TYPE": "delegatecall",
    "FROM_ADDRESS": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
    "TO_ADDRESS": "0xa57e126b341b18c262ad25b86bb4f65b5e2ade45",
    "FUNCTION_NAME": "approve",
    "ORDER_INDEX": 1622
  }

My idea was to first group by TX_HASH to get the "traces" of execution and then get the FUNCTION_NAME value for the first record of each group (which should be the one with CALL_ID = \\N) to create the dataframes and iteratively put each group in the correct dataframe. I believe that this could be the solution.

CodePudding user response：

This seems to do the trick: create a new column where the CALL_ID is equal to "\\N", then group by TX_HASH and map the TX_HASH to the first value of the newly created column. (As there can only be one CALL_ID=="\\N" per TX_HASH.

df["NEW_HASH_GROUP"] = (df.CALL_ID == "\\N") * df.FUNCTION_NAME
df["GROUP"] = df.TX_HASH.map(df.groupby("TX_HASH").NEW_HASH_GROUP.first())

# Split each group in its own DataFrame
dfs = [f for _, f in df.groupby(["GROUP"])]

If I do:

for df in dfs:
    print(df.GROUP.unique())

# ['createEstateWithMetadata']
# ['setApprovalForAll']