How to obtain counts and sums for pairs of values in each row of Pandas DataFrame-CodePudding

Problem:

I have a DataFrame like so:

import pandas as pd
df = pd.DataFrame({
    "name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
    "category":["a","b","c","b","a","b","c","c","a","c"],
    "amount":[100,200,13,23,40,2,43,92,83,1]
})

    name    | category  | amount
    ----------------------------
0   john    | a         | 100
1   jim     | b         | 200
2   eric    | c         | 13
3   jim     | b         | 23
4   john    | a         | 40
5   jim     | b         | 2
6   jim     | c         | 43
7   eric    | c         | 92
8   eric    | a         | 83
9   john    | c         | 1

I would like to add two new columns: first; the total amount for the relevant category for the name of the row (eg: the value in row 0 would be 140, because john has a total of 100 40 of the a category). Second; the counts of those name and category combinations which are being summed in the first new column (eg: the row 0 value would be 2).

Desired output:

The output I'm looking for here looks like this:

    name    | category  | amount    | sum_for_category  | count_for_category
    ------------------------------------------------------------------------
0   john    | a         | 100       | 140               | 2
1   jim     | b         | 200       | 225               | 3
2   eric    | c         | 13        | 105               | 2
3   jim     | b         | 23        | 225               | 3
4   john    | a         | 40        | 140               | 2
5   jim     | b         | 2         | 225               | 3
6   jim     | c         | 43        | 43                | 1
7   eric    | c         | 92        | 105               | 2
8   eric    | a         | 83        | 83                | 1
9   john    | c         | 1         | 1                 | 1

I don't want to group the data by the features because I want to keep the same number of rows. I just want to tag on the desired value for each row.

Best I could do:

I can't find a good way to do this. The best I've been able to come up with is the following:

names = df["name"].unique()
categories = df["category"].unique()

sum_for_category = {i:{
    j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].sum() for j in categories
} for i in names}
df["sum_for_category"] = df.apply(lambda x: sum_for_category[x["name"]][x["category"]],axis=1)

count_for_category = {i:{
    j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].count() for j in categories
} for i in names}
df["count_for_category"] = df.apply(lambda x: count_for_category[x["name"]][x["category"]],axis=1)

But this is extremely clunky and slow; far too slow to be viable on my actual dataset (roughly 700,000 rows x 10 columns). I'm sure there's a better and faster way to do this... Many thanks in advance.

CodePudding user response：

You need two groupby.transform:

g = df.groupby(['name', 'category'])['amount']
df['sum_for_category'] = g.transform('sum')
df['count_or_category'] = g.transform('size')

output:

   name category  amount  sum_for_category  count_or_category
0  john        a     100               140                  2
1   jim        b     200               225                  3
2  eric        c      13               105                  2
3   jim        b      23               225                  3
4  john        a      40               140                  2
5   jim        b       2               225                  3
6   jim        c      43                43                  1
7  eric        c      92               105                  2
8  eric        a      83                83                  1
9  john        c       1                 1                  1

CodePudding user response：

import pandas as pd
    df = pd.DataFrame({
    "name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
    "category":["a","b","c","b","a","b","c","c","a","c"],
    "amount":[100,200,13,23,40,2,43,92,83,1]
})
    df_Count = 
df.groupby(['name','category']).count().reset_index().rename({'amount':'Count_For_Category'}, axis=1)
df_Sum = df.groupby(['name','category']).sum().reset_index().rename({'amount':'Sum_For_Category'},axis=1)
df_v2 = pd.merge(df,df_Count[['name','category','Count_For_Category']],  left_on=['name','category'], right_on=['name','category'], how='left')
df_v2 = pd.merge(df_v2,df_Sum[['name','category','Sum_For_Category']],  left_on=['name','category'], right_on=['name','category'], how='left')
df_v2

Hi There,

Use a simple code to easy understand, please try these code below, Just run it you will get what you want.

Thanks Leon