Python: Group and count unique variables based on multiple grouping without recount-CodePudding

I have a Dataframe with 3 columns:

ID  Round  Investor
X   1      A
X   2      A
X   2      B
X   3      A
Y   1      A
Y   1      B
Y   1      C
Y   2      B
Y   2      D

And want to count the number of unique investors for each round for each ID. But I dont want it to recount the investor if it has been in the previous round. The code I am using is:

print(df.groupby(['ID', 'Round'])['Investor'].nunique())

Which results in:

ID  Round  Unique Investor
X   1      1
    2      2
    2      2
    3      1
Y   1      3
    1      3
    1      3
    2      2
    2      2

But I dont what it to count when an investor have invested in a earlier round for the same ID:

ID  Round  Unique Investor
X   1      1
    2      1
    2      1
    3      0
Y   1      3
    1      3
    1      3
    2      1
    2      1

Any help is greatly appreciated!

CodePudding user response：

You can define a helper column Investor2 which is grouped under ID and dropped duplicates within the same ID with Series.drop_duplicates

Then, group by ID and Round as you did before on this Investor2 column with .transform() and nunique, as follows:

df['Unique Investor'] = (
    df.assign(Investor2=df.groupby('ID')['Investor'].apply(pd.Series.drop_duplicates).droplevel(0))
      .groupby(['ID', 'Round'])['Investor2'].transform('nunique')
    )

Result:

print(df)

  ID  Round Investor  Unique Investor
0  X      1        A                1
1  X      2        A                1
2  X      2        B                1
3  X      3        A                0
4  Y      1        A                3
5  Y      1        B                3
6  Y      1        C                3
7  Y      2        B                1
8  Y      2        D                1

CodePudding user response：

One possible solution is to drop the duplicates, based on 'ID' and Investor, groupby ID and Round to get number of uniques, and merge the result back to the main dataframe:

dups = ['ID', 'Investor']

group = ['ID', 'Round']

mapping = (df.drop_duplicates(subset = dups)
             .groupby(group)
             .Investor
             .nunique()
          )

(df.filter(group)
   .merge(mapping, left_on = group, 
          right_index = True, how = 'left')
   .fillna(0, downcast='infer')
)

  ID  Round  Investor
0  X      1         1
1  X      2         1
2  X      2         1
3  X      3         0
4  Y      1         3
5  Y      1         3
6  Y      1         3
7  Y      2         1
8  Y      2         1