How to do groupby().sum() but not sum values for repeated value in column-CodePudding

I have the following df.

  wallet        position  position_rewards position_type token_transfers  total_position_rewards  rewards_pct  transfers_per_rewards 
0  0x123            FRAX               700            LD           450            1000                    0.7                    315           
0  0x123            USDC               200            LD           450            1000                    0.2                     90            
0  0x123             ETH               100      Sushi_LP           450            1000                    0.1                     45            
1  0xabc             FXS                40            LD             5              50                    0.8                      4            
1  0xabc             ETH                10      Sushi_LP             5              50                    0.2                      1

A few notes about the data:

The values in the columns total_transfers and total_position_rewards are "aggregated" by wallet. That is, each wallet has only one value for that column, even though the wallet is repeated throughout the df (because other values in the other columns differ).

What I want to output:

I am interested in creating a table/showing/charting the amount of token_transfers per position_type.

One wallet can have different positions, but the position_typecan be the same (see wallet 0x123).

total_position_rewards is a sum of all position_rewardsfor a given wallet and rewards_pct is basically position_rewards divided by total_position_rewards.

What I want to calculate is the total_token_transfers_per_position_type. Which means that I would need to group token_transfers per position_type, but since the values token_transfers are only one for each wallet, I cannot do df.groupby(['wallet', 'position_type'])['token_transfers'].sum() because I do not want the sum of token_transfers, but its sum across wallet and positon_type (and NOT sum these values within the wallet).

My code:

df.groupby(['wallet', 'position_type'])['token_transfers'].sum()

As stated above, I need ot somehowe ignore values being summed across wallet, only across position_type.

For example, if you do the above operation, I would get the following results for total_token_transfers_per_position_type:

LD: 450   450   5 = 905

Sushi_LP: 450   5 = 455

however, the correct result should be:

LD: 450   5 = 455

Sushi_LP: 450   5 = 455

Basically, I need to sum token_transfers across each position_type, but not summing them up within wallets.

If I do a df.groupby('wallet')['token_transfers'].sum() and one wallet has mulitple positon_types, I will get wrongly summed values.

If I do a df.groupby('position_type')['token_transfers'].sum() pandas will wrongly sum the values within each wallet.

CodePudding user response：

I don't understand your idea but you can always use for-loop to work with every group separatelly and later use set() to reduce duplicated

Something like this:

text = '''wallet        position  position_rewards position_type token_transfers  total_position_rewards  rewards_pct  transfers_per_rewards 
0x123            FRAX               700            LD           450            1000                    0.7                    315           
0x123            USDC               200            LD           450            1000                    0.2                     90            
0x123             ETH               100      Sushi_LP           450            1000                    0.1                     45            
0xabc             FXS                40            LD             5              50                    0.8                      4            
0xabc             ETH                10      Sushi_LP             5              50                    0.2                      1'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep='\s ')
#print(df)

for key, val in df.groupby(['position_type']):
    print('key:', key)
    print('token_transfers:', val['token_transfers'].to_list())
    print('set:', set(val['token_transfers']))
    print('sum:', sum(set(val['token_transfers'])))
    print('---')

Result:

key: LD
token_transfers: [450, 450, 5]
set: {450, 5}
sum: 455
---
key: Sushi_LP
token_transfers: [450, 5]
set: {450, 5}
sum: 455
---

Maybe it could be coded using some .map or .agg on group but using for-loop you have simpler access to groups and you can simpler use standard functions from Python.

EDIT:

Using apply() I can execute sum(set(x))

print( df.groupby('position_type')['token_transfers'].apply(lambda x:sum(set(x))) )

or .apply(set).apply(sum) (because .apply(set).sum() doesn't work)

print( df.groupby('position_type')['token_transfers'].apply(set).apply(sum) )

Result:

position_type
LD          455
Sushi_LP    455
Name: token_transfers, dtype: int64