Count and Sum of distinct values in python-CodePudding

Basically the data set I have has 4 columns as below :-

UID	Account	Name	Amount
w1	A1	Rohit	10
w2	A1	Rohit	10
w3	A2	Rohit	100
w4	B1	Sakshi	10
w5	B2	Sakshi	20
w6	B3	Sakshi	30

Now for each name I am trying to find using python

count of distinct Accounts,
sum of amount for those distinct accounts
count of UIDs for each name

The output would look like something below

Name	count of Account	count of UID	Sum of Amount
rohit	2	3	110
Sakshi	3	3	60

Till now I was able to get the counts using below snippet but couldn't calculate amount.

df = df.groupby('Name')['Account','Uid'].nunique()

CodePudding user response：

Using DataFrameGroupBy.apply, we can combine the nunique and sum operations into one function as shown below:

def f(group):
  # Define transformation dictionary
  transform = {}
  # Set Account to be number of unique accounts
  transform['Account'] = group['Account'].nunique()
  # Set Uid to be number of unique UIDs
  transform['Uid'] = group['Uid'].nunique()
  # Find names of the unique accounts
  unique_accounts = group['Account'].unique()
  # For each unique name, get the corresponding amount of the first matching row
  unique_amounts = [group[group['Account'] == u].iloc[0]['Amount'] for u in unique_accounts]
  # Set Amount to the sum of the unique amounts
  transform['Amount'] = sum(unique_amounts)
  # Return a new Series with our transformed data and its labels
  return pd.Series(transform, index=['Account', 'Uid', 'Amount'])


df = df.groupby('Name').apply(f)

This results in the following table:

        Account  UID  Amount
Name                        
Rohit         2    3     110
Sakshi        3    3      60

For more information on applying multiple functions to groupby columns, please see this post. Also, for information on obtaining just the first row of a dataframe that matches a certain criteria, see this post.

CodePudding user response：

At first glance, I would have suggested:

df.groupby('Name').agg({'Account':'count','UID':'count','Amount': 'sum'})

But as you pointed, that you want to deduplicate the Amount value per name before the sum, I guess I would do it in 2 steps (maybe there is a smarter way involving lambda functions though):

s1=df.drop_duplicates(subset=['Name','Amount']).groupby('Name')['Amount'].sum()
df1=df.groupby('Name')['Account','UID'].nunique()
df1.merge(s1.to_frame(), left_index=True, right_index=True)