I am trying to figure out a non-looping way to identify (auto-incrementing int would be ideal) the non-unique groups of rows (a group can contain 1 or more rows) within each TDateID, GroupID combination.
Here is an example DataFrame that looks like
Index | Cents | SD_YF | TDateID | GroupID |
---|---|---|---|---|
10 | 182.5 | 2.1 | 0 | 0 |
11 | 182.5 | 2.1 | 0 | 0 |
12 | 153.5 | 1.05 | 0 | 1 |
13 | 153.5 | 1.05 | 0 | 1 |
14 | 43 | 11 | 1 | 2 |
15 | 43 | 11 | 1 | 2 |
4 | 152 | 21 | 1 | 2 |
5 | 152 | 21 | 1 | 2 |
My ideal output would be:
Index | Cents | SD_YF | TDateID | GroupID | UniID |
---|---|---|---|---|---|
10 | 182.5 | 2.1 | 0 | 0 | 1 |
11 | 182.5 | 2.1 | 0 | 0 | 2 |
12 | 153.5 | 1.05 | 0 | 1 | 3 |
13 | 153.5 | 1.05 | 0 | 1 | 4 |
14 | 43 | 11 | 1 | 2 | 5 |
15 | 43 | 11 | 1 | 2 | 6 |
4 | 152 | 21 | 1 | 2 | 5 |
5 | 152 | 21 | 1 | 2 | 6 |
I have bolded #5 to draw attention to how index 14, 4 are paired together. Similar with #6. I hope that makes sense!
CodePudding user response:
IIUC you need to add the group number the cumcount per duplicate 1:
df['UniID'] = (df['GroupID']
df.groupby('GroupID').ngroup().add(1)
df.groupby(['GroupID', 'Cents', 'SD_YF']).cumcount()
)
output:
Index Cents SD_YF GroupID UniID
0 10 182.5 2.10 0 1
1 11 182.5 2.10 0 2
2 12 153.5 1.05 1 3
3 13 153.5 1.05 1 4
4 14 43.0 11.00 2 5
5 15 43.0 11.00 2 6
6 4 152.0 21.00 2 5
7 5 152.0 21.00 2 6