Basically the data set I have has 4 columns as below :-
UID | Account | Name | Amount |
---|---|---|---|
w1 | A1 | Rohit | 10 |
w2 | A1 | Rohit | 10 |
w3 | A2 | Rohit | 100 |
w4 | B1 | Sakshi | 10 |
w5 | B2 | Sakshi | 20 |
w6 | B3 | Sakshi | 30 |
Now for each name I am trying to find using python
- count of distinct Accounts,
- sum of amount for those distinct accounts
- count of UIDs for each name
The output would look like something below
Name | count of Account | count of UID | Sum of Amount |
---|---|---|---|
rohit | 2 | 3 | 110 |
Sakshi | 3 | 3 | 60 |
Till now I was able to get the counts using below snippet but couldn't calculate amount.
df = df.groupby('Name')['Account','Uid'].nunique()
CodePudding user response:
Using DataFrameGroupBy.apply
, we can combine the nunique
and sum
operations into one function as shown below:
def f(group):
# Define transformation dictionary
transform = {}
# Set Account to be number of unique accounts
transform['Account'] = group['Account'].nunique()
# Set Uid to be number of unique UIDs
transform['Uid'] = group['Uid'].nunique()
# Find names of the unique accounts
unique_accounts = group['Account'].unique()
# For each unique name, get the corresponding amount of the first matching row
unique_amounts = [group[group['Account'] == u].iloc[0]['Amount'] for u in unique_accounts]
# Set Amount to the sum of the unique amounts
transform['Amount'] = sum(unique_amounts)
# Return a new Series with our transformed data and its labels
return pd.Series(transform, index=['Account', 'Uid', 'Amount'])
df = df.groupby('Name').apply(f)
This results in the following table:
Account UID Amount
Name
Rohit 2 3 110
Sakshi 3 3 60
For more information on applying multiple functions to groupby columns, please see this post. Also, for information on obtaining just the first row of a dataframe that matches a certain criteria, see this post.
CodePudding user response:
At first glance, I would have suggested:
df.groupby('Name').agg({'Account':'count','UID':'count','Amount': 'sum'})
But as you pointed, that you want to deduplicate the Amount value per name before the sum, I guess I would do it in 2 steps (maybe there is a smarter way involving lambda functions though):
s1=df.drop_duplicates(subset=['Name','Amount']).groupby('Name')['Amount'].sum()
df1=df.groupby('Name')['Account','UID'].nunique()
df1.merge(s1.to_frame(), left_index=True, right_index=True)