Home > Net >  Count and Sum of distinct values in python
Count and Sum of distinct values in python

Time:04-06

Basically the data set I have has 4 columns as below :-

UID Account Name Amount
w1 A1 Rohit 10
w2 A1 Rohit 10
w3 A2 Rohit 100
w4 B1 Sakshi 10
w5 B2 Sakshi 20
w6 B3 Sakshi 30

Now for each name I am trying to find using python

  1. count of distinct Accounts,
  2. sum of amount for those distinct accounts
  3. count of UIDs for each name

The output would look like something below

Name count of Account count of UID Sum of Amount
rohit 2 3 110
Sakshi 3 3 60

Till now I was able to get the counts using below snippet but couldn't calculate amount.

df = df.groupby('Name')['Account','Uid'].nunique()

CodePudding user response:

Using DataFrameGroupBy.apply, we can combine the nunique and sum operations into one function as shown below:

def f(group):
  # Define transformation dictionary
  transform = {}
  # Set Account to be number of unique accounts
  transform['Account'] = group['Account'].nunique()
  # Set Uid to be number of unique UIDs
  transform['Uid'] = group['Uid'].nunique()
  # Find names of the unique accounts
  unique_accounts = group['Account'].unique()
  # For each unique name, get the corresponding amount of the first matching row
  unique_amounts = [group[group['Account'] == u].iloc[0]['Amount'] for u in unique_accounts]
  # Set Amount to the sum of the unique amounts
  transform['Amount'] = sum(unique_amounts)
  # Return a new Series with our transformed data and its labels
  return pd.Series(transform, index=['Account', 'Uid', 'Amount'])


df = df.groupby('Name').apply(f)

This results in the following table:

        Account  UID  Amount
Name                        
Rohit         2    3     110
Sakshi        3    3      60

For more information on applying multiple functions to groupby columns, please see this post. Also, for information on obtaining just the first row of a dataframe that matches a certain criteria, see this post.

CodePudding user response:

At first glance, I would have suggested:

df.groupby('Name').agg({'Account':'count','UID':'count','Amount': 'sum'})

But as you pointed, that you want to deduplicate the Amount value per name before the sum, I guess I would do it in 2 steps (maybe there is a smarter way involving lambda functions though):

s1=df.drop_duplicates(subset=['Name','Amount']).groupby('Name')['Amount'].sum()
df1=df.groupby('Name')['Account','UID'].nunique()
df1.merge(s1.to_frame(), left_index=True, right_index=True)
  • Related