Home > database >  How to count unique values in one colulmn based on value in another column by group in Pandas
How to count unique values in one colulmn based on value in another column by group in Pandas

Time:11-23

I'm trying to count unique values in one column only when the value meets a certain condition based on another column. For example, the data looks like this:

GroupID     ID       Value
ABC         TX123     0
ABC         TX678     1
ABC         TX678     2
DEF         AG123     1
DEF         AG123     1
DEF         AG123     1
GHI         TE203     0
GHI         TE203     0

I want to count the number of unique ID by Group ID but only when the value column >0. When all values for a group ID =0, it will simply have 0. For example, the result dataset would look like this:

GroupID    UniqueNum
ABC           1
DEF           1
GHI           0

I've tried the following but it simply returns the uique number of IDs regardless of its value. How do I add the condition of when value >0?

count_df = df.groupby(['GroupID'])['ID'].nunique()

CodePudding user response:

positive counts only

You can use pre-filtering with loc and named aggregation with groupby.agg('nunique'):

(df.loc[df['Value'].gt(0), 'ID']
   .groupby(df['GroupID'])
   .agg(UniqueNum='nunique')
   .reset_index()
)

Output:

  GroupID  UniqueNum
0     ABC          1
1     DEF          1

all counts (including zero)

If you want to count as zero, the groups with no match, you can reindex:

(df.loc[df['Value'].gt(0), 'ID']
   .groupby(df['GroupID'])
   .agg(UniqueNum='nunique')
   .reindex(df['GroupID'].unique(), fill_value=0)
   .reset_index()
)

Or mask:

(df['ID'].where(df['Value'].gt(0))
   .groupby(df['GroupID'])
   .agg(UniqueNum='nunique')
   .reset_index()
)

Output:

  GroupID  UniqueNum
0     ABC          1
1     DEF          1
2     GHI          0

Used input:

GroupID     ID       Value
ABC         TX123     0
ABC         TX678     1
ABC         TX678     2
DEF         AG123     1
DEF         AG123     1
DEF         AG123     1
GHI         AB123     0

CodePudding user response:

If need 0 for non matched values use Series.where for NaNs for non matched condition, then aggregate by DataFrameGroupBy.nunique:

df = pd.DataFrame({ 'GroupID': ['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'NEW'],
                    'ID': ['TX123', 'TX678', 'TX678', 'AG123', 'AG123', 'AG123'],
                    'Value': [0, 1, 2, 1, 1, 0]})

df = (df['ID'].where(df["Value"].gt(0)).groupby(df['GroupID'])
              .nunique()
              .reset_index(name='nunique'))

print (df)
  GroupID  nunique
0     ABC        1
1     DEF        1
2     NEW        0

How it working:

print (df.assign(new=df['ID'].where(df["Value"].gt(0))))
  GroupID     ID  Value    new
0     ABC  TX123      0    NaN
1     ABC  TX678      1  TX678
2     ABC  TX678      2  TX678
3     DEF  AG123      1  AG123
4     DEF  AG123      1  AG123
5     NEW  AG123      0    NaN <- set NaN for non matched condition
  • Related