Home > Software design >  Pandas: group records by a column value and timestamp and apply a function on each record
Pandas: group records by a column value and timestamp and apply a function on each record

Time:05-11

I have a JSON file of Ethereum transactions with the following structure:

  ...,
  {
    "blockNumber": "14492022",
    "timeStamp": "1648703953",
    "hash": "0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e",
    "nonce": "89",
    "blockHash": "0xa804b6c72753657275c58b00bf17cd2ca7e2be3cbf4a6615f4cc7175c3c76aea",
    "transactionIndex": "162",
    "from": "0x4ca43dc185ff11844e448604cd11409a92a3794b",
    "to": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
    "value": "0",
    "gas": "130000",
    "gasPrice": "40449366242",
    "isError": "0",
    "txreceipt_status": "1",
    "input": "0x23b872dd0000000000000000000000004ca43dc185ff11844e448604cd11409a92a3794b00000000000000000000000034380456f50e013f1b8b2b9b5dc9d55fb0ca9c2b00000000000000000000000000000042ffffffffffffffffffffffffffffff76",
    "contractAddress": "",
    "cumulativeGasUsed": "9370793",
    "gasUsed": "106123",
    "confirmations": "180974",
  },
  ...

Pandas dataframes should be used to perform the following operations.

First of all, timeStamp should be converted into a suitable format. Then, I need to group the data by from value and an interval of 1 week. Then in each element of a group, I would like to add a new key (the name doesn't matter) with the value:

  • f"{from}_{timestamp_of_first_record_of_the_group}_{time_interval}

So, some like the following:

  • 0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e_2022-03-31T05:19:13_1W

I need this new key to identify the elements occurring in the same time interval to perform additional analysis with external tools.

So far I've tried the following:

df = pd.read_json(file_path)
df['timeStamp'] = pd.to_datetime(df['timeStamp'],unit='s')

df_grouped = transactions_df.groupby(['from'], pd.Grouper(key="timeStamp", freq="1W"))

# from here I don't know how to apply the changes described above

CodePudding user response:

Here is a possible answer:

Data:

Here is some example data. I stripped the unnecessary columns for the sake of this answer.

     timeStamp     from
0   1648703953  0xaaaaa
1   1648779553  0xaaaaa
2   1648855153  0xaaaaa
3   1648930753  0xaaaaa
4   1649006353  0xaaaaa
5   1649081953  0xaaaaa
6   1649157553  0xaaaaa
7   1649233153  0xaaaaa
8   1649308753  0xaaaaa
9   1649384353  0xaaaaa
10  1649459953  0xaaaaa
11  1649535553  0xaaaaa
12  1649611153  0xaaaaa
13  1649686753  0xaaaaa
14  1649762353  0xaaaaa
15  1648703953  0xFFFFF
16  1648779553  0xFFFFF
17  1648855153  0xFFFFF
18  1648930753  0xFFFFF
19  1649006353  0xFFFFF
20  1649081953  0xFFFFF
21  1649157553  0xFFFFF
22  1649233153  0xFFFFF
23  1649308753  0xFFFFF
24  1649384353  0xFFFFF

Code:

# Timestamp to readable format.
df['timeStamp'] = pd.to_datetime(df['timeStamp'].astype(int), unit='s')

# Group by week.
df['week_id'] = df.groupby(['from', pd.Grouper(key='timeStamp', freq='1W-MON')]).ngroup()
df['week_id'] -= df.groupby('from')['week_id'].transform('min')

# Tag.
df['first_timeStamp'] = df.groupby(['from', 'week_id'])['timeStamp'].transform('min')
df['tag'] = df['from']   '_'   df['first_timeStamp'].dt.strftime('%Y-%m-%dT%H:%M:%S')   '_'   df['week_id'].astype(str)

print(df.drop(columns=['week_id', 'first_timeStamp']))

Result:

             timeStamp     from                            tag
0  2022-03-31 05:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
1  2022-04-01 02:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
2  2022-04-01 23:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
3  2022-04-02 20:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
4  2022-04-03 17:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
5  2022-04-04 14:19:13  0xaaaaa  0xaaaaa_2022-03-31T05:19:13_0
6  2022-04-05 11:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
7  2022-04-06 08:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
8  2022-04-07 05:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
9  2022-04-08 02:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
10 2022-04-08 23:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
11 2022-04-09 20:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
12 2022-04-10 17:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
13 2022-04-11 14:19:13  0xaaaaa  0xaaaaa_2022-04-05T11:19:13_1
14 2022-04-12 11:19:13  0xaaaaa  0xaaaaa_2022-04-12T11:19:13_2
15 2022-03-31 05:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
16 2022-04-01 02:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
17 2022-04-01 23:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
18 2022-04-02 20:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
19 2022-04-03 17:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
20 2022-04-04 14:19:13  0xFFFFF  0xFFFFF_2022-03-31T05:19:13_0
21 2022-04-05 11:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1
22 2022-04-06 08:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1
23 2022-04-07 05:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1
24 2022-04-08 02:19:13  0xFFFFF  0xFFFFF_2022-04-05T11:19:13_1

Note that here it's grouped by monthly week, so the first week does not start at the first day in the data.

  • Related