Restarting count column when different row appears in Python/Pandas-CodePudding

I have the following Pandas DataFrame with multiple amounts for each address. The count of 'amounts' per address varies.

Index                                           type    amount
0   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow 2.553800e 04
1   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  1.999650e 05
2   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  3.896400e 04
3   0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F  inflow  3.060000e 05
4   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.569367e 05
5   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.219739e 04

I want to create a column that would count from 1 to n rows per address, but I don't know how to restart the count at the next address.

Something that would look like this:

Index                                           type    amount          Epoch
0   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow 2.553800e 04    1
1   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  1.999650e 05    2
2   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  3.896400e 04    3
3   0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F  inflow  3.060000e 05    1
4   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.569367e 05    1
5   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.219739e 04    2

As you can see, the count of epochs restarts when a row with a new address appears.

How can I create a logic for that column for any given number of addresses/rows?

Additionally: is there anything I should watch out for when structuring the DataFrame? E.G., always grouping equal addresses and not having them appear in random places in the DataFrame.

CodePudding user response：

Use groupby with cumcount:

If you want the count to continue if an address re-occurs later:

df["Epoch"] = df.groupby("Index").cumcount() 1

>>> df
                                        Index     type     amount  Epoch
0  0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow   25538.00      1
1  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      2
2  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      3
3  0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F   inflow  306000.00      1
4  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow  156936.70      1
5  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow   12197.39      2
6  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      4
7  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      5

If you want the count to re-start at 1 for the address that re-occurs:

df["Epoch"] = df.groupby((df["Index"]!=df["Index"].shift()).cumsum()).cumcount() 1

>>> df
                                        Index     type     amount  Epoch
0  0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow   25538.00      1
1  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      2
2  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      3
3  0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F   inflow  306000.00      1
4  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow  156936.70      1
5  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow   12197.39      2
6  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      1
7  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      2

Note the difference in output in the last two rows. I copied the second and third rows of your example to the end of the DataFrame to illustrate the difference in the two methods.