Home > OS >  Pandas - Cumulative Count with Labeling
Pandas - Cumulative Count with Labeling

Time:11-17

I have a pandas df that looks like the following:

 --------- --------- ------------ -------- 
| Cluster | Country | Publishers | Assets |
 --------- --------- ------------ -------- 
| South   | IT      | SS         | Asset1 |
| South   | IT      | SS         | Asset2 |
| South   | IT      | SS         | Asset3 |
| South   | IT      | ML         | Asset1 |
| South   | IT      | ML         | Asset2 |
| South   | IT      | ML         | Asset3 |
| South   | IT      | TT         | Asset1 |
| South   | IT      | TT         | Asset2 |
| South   | IT      | TT         | Asset3 |
| South   | ES      | SS         | Asset1 |
| South   | ES      | SS         | Asset2 |
 --------- --------- ------------ -------- 

I would like to create a new column "Package" that uses a cumulative count based on the following columns:

  • Publishers
  • Assets

The result would be this:

 --------- --------- ------------ -------- --------- 
| Cluster | Country | Publishers | Assets | Package |
 --------- --------- ------------ -------- --------- 
| South   | IT      | SS         | Asset1 | 1       |
| South   | IT      | SS         | Asset2 | 1a      |
| South   | IT      | SS         | Asset3 | 1b      |
| South   | IT      | ML         | Asset1 | 2       |
| South   | IT      | ML         | Asset2 | 2a      |
| South   | IT      | ML         | Asset3 | 2b      |
| South   | IT      | TT         | Asset1 | 3       |
| South   | IT      | TT         | Asset2 | 3a      |
| South   | IT      | TT         | Asset3 | 3b      |
| South   | ES      | SS         | Asset1 | 4       |
| South   | ES      | SS         | Asset2 | 4a      |
 --------- --------- ------------ -------- --------- 

So far I have tried

df['Package'] = df.groupby(['Cluster','Publishers']).cumcount() but it seems not to work as the value resets to 0 after every publisher instance is gone through.

CodePudding user response:

You can use groupby.cumcount, but with a different grouper. You will also need the related groupby.ngroup:

from string import ascii_lowercase

# group by consecutive identical values
group = df['Publishers'].ne(df['Publishers'].shift()).cumsum()
# alternatively, you can also group by Cluster/Country/Publishers
# group = ['Cluster', 'Country', 'Publisher']


df['Package'] =(
  df.groupby(group).ngroup().add(1).astype(str)
  df.groupby(group).cumcount().map(dict(enumerate([''] list(ascii_lowercase))))
)

output:

   Cluster Country Publishers  Assets Package
0    South      IT         SS  Asset1       1
1    South      IT         SS  Asset2      1a
2    South      IT         SS  Asset3      1b
3    South      IT         ML  Asset1       2
4    South      IT         ML  Asset2      2a
5    South      IT         ML  Asset3      2b
6    South      IT         TT  Asset1       3
7    South      IT         TT  Asset2      3a
8    South      IT         TT  Asset3      3b
9    South      ES         SS  Asset1       4
10   South      ES         SS  Asset2      4a
  • Related