Home > front end >  How to obtain counts and sums for pairs of values in each row of Pandas DataFrame
How to obtain counts and sums for pairs of values in each row of Pandas DataFrame

Time:08-20

Problem:

I have a DataFrame like so:

import pandas as pd
df = pd.DataFrame({
    "name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
    "category":["a","b","c","b","a","b","c","c","a","c"],
    "amount":[100,200,13,23,40,2,43,92,83,1]
})
    name    | category  | amount
    ----------------------------
0   john    | a         | 100
1   jim     | b         | 200
2   eric    | c         | 13
3   jim     | b         | 23
4   john    | a         | 40
5   jim     | b         | 2
6   jim     | c         | 43
7   eric    | c         | 92
8   eric    | a         | 83
9   john    | c         | 1

I would like to add two new columns: first; the total amount for the relevant category for the name of the row (eg: the value in row 0 would be 140, because john has a total of 100 40 of the a category). Second; the counts of those name and category combinations which are being summed in the first new column (eg: the row 0 value would be 2).

Desired output:

The output I'm looking for here looks like this:

    name    | category  | amount    | sum_for_category  | count_for_category
    ------------------------------------------------------------------------
0   john    | a         | 100       | 140               | 2
1   jim     | b         | 200       | 225               | 3
2   eric    | c         | 13        | 105               | 2
3   jim     | b         | 23        | 225               | 3
4   john    | a         | 40        | 140               | 2
5   jim     | b         | 2         | 225               | 3
6   jim     | c         | 43        | 43                | 1
7   eric    | c         | 92        | 105               | 2
8   eric    | a         | 83        | 83                | 1
9   john    | c         | 1         | 1                 | 1

I don't want to group the data by the features because I want to keep the same number of rows. I just want to tag on the desired value for each row.

Best I could do:

I can't find a good way to do this. The best I've been able to come up with is the following:

names = df["name"].unique()
categories = df["category"].unique()

sum_for_category = {i:{
    j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].sum() for j in categories
} for i in names}
df["sum_for_category"] = df.apply(lambda x: sum_for_category[x["name"]][x["category"]],axis=1)

count_for_category = {i:{
    j:df.loc[(df["name"]==i)&(df["category"]==j)]["amount"].count() for j in categories
} for i in names}
df["count_for_category"] = df.apply(lambda x: count_for_category[x["name"]][x["category"]],axis=1)

But this is extremely clunky and slow; far too slow to be viable on my actual dataset (roughly 700,000 rows x 10 columns). I'm sure there's a better and faster way to do this... Many thanks in advance.

CodePudding user response:

You need two groupby.transform:

g = df.groupby(['name', 'category'])['amount']
df['sum_for_category'] = g.transform('sum')
df['count_or_category'] = g.transform('size')

output:

   name category  amount  sum_for_category  count_or_category
0  john        a     100               140                  2
1   jim        b     200               225                  3
2  eric        c      13               105                  2
3   jim        b      23               225                  3
4  john        a      40               140                  2
5   jim        b       2               225                  3
6   jim        c      43                43                  1
7  eric        c      92               105                  2
8  eric        a      83                83                  1
9  john        c       1                 1                  1

CodePudding user response:

import pandas as pd
    df = pd.DataFrame({
    "name":["john","jim","eric","jim","john","jim","jim","eric","eric","john"],
    "category":["a","b","c","b","a","b","c","c","a","c"],
    "amount":[100,200,13,23,40,2,43,92,83,1]
})
    df_Count = 
df.groupby(['name','category']).count().reset_index().rename({'amount':'Count_For_Category'}, axis=1)
df_Sum = df.groupby(['name','category']).sum().reset_index().rename({'amount':'Sum_For_Category'},axis=1)
df_v2 = pd.merge(df,df_Count[['name','category','Count_For_Category']],  left_on=['name','category'], right_on=['name','category'], how='left')
df_v2 = pd.merge(df_v2,df_Sum[['name','category','Sum_For_Category']],  left_on=['name','category'], right_on=['name','category'], how='left')
df_v2

Hi There,

Use a simple code to easy understand, please try these code below, Just run it you will get what you want.

Thanks Leon

  • Related