I have the following Pandas DataFrame with multiple amounts for each address. The count of 'amounts' per address varies.
Index type amount
0 0xd81c0B4FEA284c908C5700187a67698b416a6bcc outflow 2.553800e 04
1 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 1.999650e 05
2 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 3.896400e 04
3 0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F inflow 3.060000e 05
4 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 1.569367e 05
5 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 1.219739e 04
I want to create a column that would count from 1 to n rows per address, but I don't know how to restart the count at the next address.
Something that would look like this:
Index type amount Epoch
0 0xd81c0B4FEA284c908C5700187a67698b416a6bcc outflow 2.553800e 04 1
1 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 1.999650e 05 2
2 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 3.896400e 04 3
3 0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F inflow 3.060000e 05 1
4 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 1.569367e 05 1
5 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 1.219739e 04 2
As you can see, the count of epochs restarts when a row with a new address appears.
How can I create a logic for that column for any given number of addresses/rows?
Additionally: is there anything I should watch out for when structuring the DataFrame? E.G., always grouping equal addresses and not having them appear in random places in the DataFrame.
CodePudding user response:
Use groupby
with cumcount
:
- If you want the count to continue if an address re-occurs later:
df["Epoch"] = df.groupby("Index").cumcount() 1
>>> df
Index type amount Epoch
0 0xd81c0B4FEA284c908C5700187a67698b416a6bcc outflow 25538.00 1
1 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 199965.00 2
2 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 38964.00 3
3 0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F inflow 306000.00 1
4 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 156936.70 1
5 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 12197.39 2
6 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 199965.00 4
7 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 38964.00 5
- If you want the count to re-start at 1 for the address that re-occurs:
df["Epoch"] = df.groupby((df["Index"]!=df["Index"].shift()).cumsum()).cumcount() 1
>>> df
Index type amount Epoch
0 0xd81c0B4FEA284c908C5700187a67698b416a6bcc outflow 25538.00 1
1 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 199965.00 2
2 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 38964.00 3
3 0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F inflow 306000.00 1
4 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 156936.70 1
5 0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af outflow 12197.39 2
6 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 199965.00 1
7 0xd81c0B4FEA284c908C5700187a67698b416a6bcc inflow 38964.00 2
Note the difference in output in the last two rows. I copied the second and third rows of your example to the end of the DataFrame to illustrate the difference in the two methods.