How to count unique values in one colulmn based on value in another column by group in Pandas-CodePudding

I'm trying to count unique values in one column only when the value meets a certain condition based on another column. For example, the data looks like this:

GroupID     ID       Value
ABC         TX123     0
ABC         TX678     1
ABC         TX678     2
DEF         AG123     1
DEF         AG123     1
DEF         AG123     1
GHI         TE203     0
GHI         TE203     0

I want to count the number of unique ID by Group ID but only when the value column >0. When all values for a group ID =0, it will simply have 0. For example, the result dataset would look like this:

GroupID    UniqueNum
ABC           1
DEF           1
GHI           0

I've tried the following but it simply returns the uique number of IDs regardless of its value. How do I add the condition of when value >0?

count_df = df.groupby(['GroupID'])['ID'].nunique()

CodePudding user response：

positive counts only

You can use pre-filtering with loc and named aggregation with groupby.agg('nunique'):

(df.loc[df['Value'].gt(0), 'ID']
   .groupby(df['GroupID'])
   .agg(UniqueNum='nunique')
   .reset_index()
)

Output:

  GroupID  UniqueNum
0     ABC          1
1     DEF          1

all counts (including zero)

If you want to count as zero, the groups with no match, you can reindex:

(df.loc[df['Value'].gt(0), 'ID']
   .groupby(df['GroupID'])
   .agg(UniqueNum='nunique')
   .reindex(df['GroupID'].unique(), fill_value=0)
   .reset_index()
)

Or mask:

(df['ID'].where(df['Value'].gt(0))
   .groupby(df['GroupID'])
   .agg(UniqueNum='nunique')
   .reset_index()
)

Output:

  GroupID  UniqueNum
0     ABC          1
1     DEF          1
2     GHI          0

Used input:

GroupID     ID       Value
ABC         TX123     0
ABC         TX678     1
ABC         TX678     2
DEF         AG123     1
DEF         AG123     1
DEF         AG123     1
GHI         AB123     0

CodePudding user response：

If need 0 for non matched values use Series.where for NaNs for non matched condition, then aggregate by DataFrameGroupBy.nunique:

df = pd.DataFrame({ 'GroupID': ['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'NEW'],
                    'ID': ['TX123', 'TX678', 'TX678', 'AG123', 'AG123', 'AG123'],
                    'Value': [0, 1, 2, 1, 1, 0]})

df = (df['ID'].where(df["Value"].gt(0)).groupby(df['GroupID'])
              .nunique()
              .reset_index(name='nunique'))

print (df)
  GroupID  nunique
0     ABC        1
1     DEF        1
2     NEW        0

How it working:

print (df.assign(new=df['ID'].where(df["Value"].gt(0))))
  GroupID     ID  Value    new
0     ABC  TX123      0    NaN
1     ABC  TX678      1  TX678
2     ABC  TX678      2  TX678
3     DEF  AG123      1  AG123
4     DEF  AG123      1  AG123
5     NEW  AG123      0    NaN <- set NaN for non matched condition