I am working in python on a pandas data frame and am trying to count unique values of a column within groups. My problem is that I need that count to represent steadily increasing numbers of rows within the groups and I also don't want NaNs to be counted.
Simplified, the data looks like this
ID occup
1 NaN
1 A
1 NaN
1 Nan
1 B
2 K
2 NaN
2 L
2 L
2 M
The new column 'occupcount' should, within the groups defined by 'ID', count the number of unique values in 'occup' but, in the first row of each group I want the count to only consider the first row in the respective group. In the second row, I want to count over the first two rows. In the fifth row, I want the count of unique values over all five rows within each group. It should look like this:
ID occup occupcount
1 NaN 0
1 A 1
1 NaN 1
1 B 2
1 A 2
2 K 1
2 NaN 1
2 L 2
2 K 2
2 M 3
I tried to solve the task with something like
df['occupcount'] = (df.groupby(["ID"])['occup'].transform('nunique'))
But it only provides the total amount of unique values over all rows within each group, no gradual increase. Thanks in advance!
CodePudding user response:
Idea is chain first duplicated values by both columns with not missing values for mask and then use GroupBy.cumsum
:
df['occupcount'] = ((~df.duplicated(['ID','occup']) & df['occup'].notna())
.groupby(df['ID'])
.cumsum())
print (df)
ID occup occupcount
0 1 NaN 0
1 1 A 1
2 1 NaN 1
3 1 B 2
4 1 A 2
5 2 K 1
6 2 NaN 1
7 2 L 2
8 2 L 2
9 2 M 3