I have a JSON file of Ethereum transactions with the following structure:
...,
{
"blockNumber": "14492022",
"timeStamp": "1648703953",
"hash": "0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e",
"nonce": "89",
"blockHash": "0xa804b6c72753657275c58b00bf17cd2ca7e2be3cbf4a6615f4cc7175c3c76aea",
"transactionIndex": "162",
"from": "0x4ca43dc185ff11844e448604cd11409a92a3794b",
"to": "0xf87e31492faf9a91b02ee0deaad50d51d56d5d4d",
"value": "0",
"gas": "130000",
"gasPrice": "40449366242",
"isError": "0",
"txreceipt_status": "1",
"input": "0x23b872dd0000000000000000000000004ca43dc185ff11844e448604cd11409a92a3794b00000000000000000000000034380456f50e013f1b8b2b9b5dc9d55fb0ca9c2b00000000000000000000000000000042ffffffffffffffffffffffffffffff76",
"contractAddress": "",
"cumulativeGasUsed": "9370793",
"gasUsed": "106123",
"confirmations": "180974",
},
...
Pandas dataframes should be used to perform the following operations.
First of all, timeStamp
should be converted into a suitable format. Then, I need to group the data by from
value and an interval of 1 week. Then in each element of a group, I would like to add a new key (the name doesn't matter) with the value:
f"{from}_{timestamp_of_first_record_of_the_group}_{time_interval}
So, some like the following:
0xdc15c50f4532ec385a3747f2a0e646922a395f6aa574794a14d07d8219ddea3e_2022-03-31T05:19:13_1W
I need this new key to identify the elements occurring in the same time interval to perform additional analysis with external tools.
So far I've tried the following:
df = pd.read_json(file_path)
df['timeStamp'] = pd.to_datetime(df['timeStamp'],unit='s')
df_grouped = transactions_df.groupby(['from'], pd.Grouper(key="timeStamp", freq="1W"))
# from here I don't know how to apply the changes described above
CodePudding user response:
Here is a possible answer:
Data:
Here is some example data. I stripped the unnecessary columns for the sake of this answer.
timeStamp from
0 1648703953 0xaaaaa
1 1648779553 0xaaaaa
2 1648855153 0xaaaaa
3 1648930753 0xaaaaa
4 1649006353 0xaaaaa
5 1649081953 0xaaaaa
6 1649157553 0xaaaaa
7 1649233153 0xaaaaa
8 1649308753 0xaaaaa
9 1649384353 0xaaaaa
10 1649459953 0xaaaaa
11 1649535553 0xaaaaa
12 1649611153 0xaaaaa
13 1649686753 0xaaaaa
14 1649762353 0xaaaaa
15 1648703953 0xFFFFF
16 1648779553 0xFFFFF
17 1648855153 0xFFFFF
18 1648930753 0xFFFFF
19 1649006353 0xFFFFF
20 1649081953 0xFFFFF
21 1649157553 0xFFFFF
22 1649233153 0xFFFFF
23 1649308753 0xFFFFF
24 1649384353 0xFFFFF
Code:
# Timestamp to readable format.
df['timeStamp'] = pd.to_datetime(df['timeStamp'].astype(int), unit='s')
# Group by week.
df['week_id'] = df.groupby(['from', pd.Grouper(key='timeStamp', freq='1W-MON')]).ngroup()
df['week_id'] -= df.groupby('from')['week_id'].transform('min')
# Tag.
df['first_timeStamp'] = df.groupby(['from', 'week_id'])['timeStamp'].transform('min')
df['tag'] = df['from'] '_' df['first_timeStamp'].dt.strftime('%Y-%m-%dT%H:%M:%S') '_' df['week_id'].astype(str)
print(df.drop(columns=['week_id', 'first_timeStamp']))
Result:
timeStamp from tag
0 2022-03-31 05:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
1 2022-04-01 02:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
2 2022-04-01 23:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
3 2022-04-02 20:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
4 2022-04-03 17:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
5 2022-04-04 14:19:13 0xaaaaa 0xaaaaa_2022-03-31T05:19:13_0
6 2022-04-05 11:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
7 2022-04-06 08:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
8 2022-04-07 05:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
9 2022-04-08 02:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
10 2022-04-08 23:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
11 2022-04-09 20:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
12 2022-04-10 17:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
13 2022-04-11 14:19:13 0xaaaaa 0xaaaaa_2022-04-05T11:19:13_1
14 2022-04-12 11:19:13 0xaaaaa 0xaaaaa_2022-04-12T11:19:13_2
15 2022-03-31 05:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
16 2022-04-01 02:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
17 2022-04-01 23:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
18 2022-04-02 20:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
19 2022-04-03 17:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
20 2022-04-04 14:19:13 0xFFFFF 0xFFFFF_2022-03-31T05:19:13_0
21 2022-04-05 11:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
22 2022-04-06 08:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
23 2022-04-07 05:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
24 2022-04-08 02:19:13 0xFFFFF 0xFFFFF_2022-04-05T11:19:13_1
Note that here it's grouped by monthly week, so the first week does not start at the first day in the data.