Pandas dataframe IF in a SUM function-CodePudding

i have two data frame

     asks_price  asks_qty exchange_name_ask     bids_price  bids_qty exchange_name_bid
0      20156.51  0.000745          Coinbase      20153.28  0.000200          Coinbase
1      20157.52  0.050000          Coinbase      20152.27  0.051000          Coinbase
2      20158.52  0.000745          Coinbase      20151.28  0.000200          Kraken
3      20158.52  0.050000          FTX           20151.28  0.051000          Coinbase

I would like to group the same price, add quantity together and combine the name of the exchange like :

     asks_price  asks_qty exchange_name_ask     bids_price  bids_qty exchange_name_bid
0    20156.51  0.000745          Coinbase      20153.28  0.000200          Coinbase
1    20157.52  0.050000          Coinbase      20152.27  0.051000          Coinbase
2    20158.52  0.050745          CoinbaseFTX   20151.28  0.051200       KrakenCoinbase

I succeeded using

df.groupby(['asks_price', 'bids_price']).sum(False)

It's working BUT if this is the same name i dont want to concat it. I dont want CoinbaseCoinbase or KrakenKraken So i have to include an if in the SUM function , how can i do it Thanks

CodePudding user response：

Try as follows:

import pandas as pd

# Changing `2: 'Kraken'` to `2: 'Coinbase' for `exchange_name_bid` to generate
# an example that doesn't require concatenation
data = {'asks_price': {0: 20156.51, 1: 20157.52, 2: 20158.52, 3: 20158.52}, 
        'asks_qty': {0: 0.000745, 1: 0.05, 2: 0.000745, 3: 0.05}, 
        'exchange_name_ask': {0: 'Coinbase', 1: 'Coinbase', 2: 'Coinbase', 
                              3: 'FTX'}, 
        'bids_price': {0: 20153.28, 1: 20152.27, 2: 20151.28, 3: 20151.28}, 
        'bids_qty': {0: 0.0002, 1: 0.051, 2: 0.0002, 3: 0.051}, 
        'exchange_name_bid': {0: 'Coinbase', 1: 'Coinbase', 2: 'Coinbase', 
                              3: 'Coinbase'}}
df = pd.DataFrame(data)

res = df.groupby(['asks_price', 'bids_price'], as_index=False).agg(
    {'asks_qty':'sum',
     'bids_qty':'sum',
     'exchange_name_ask': lambda x: ','.join(dict.fromkeys(x).keys()),
     'exchange_name_bid': lambda x: ','.join(dict.fromkeys(x).keys())
     })

print(res)

   asks_price  bids_price  asks_qty  bids_qty exchange_name_ask  \
0    20156.51    20153.28  0.000745    0.0002          Coinbase   
1    20157.52    20152.27  0.050000    0.0510          Coinbase   
2    20158.52    20151.28  0.050745    0.0512      Coinbase,FTX   

  exchange_name_bid  
0          Coinbase  
1          Coinbase  
2          Coinbase  # no concatenation here

Explanation

Use df.groupby.agg and split the functions to be applied to the different columns. I.e. sum for columns '*_qty', and a lambda function for columns 'exchange_name_*' with dict.fromkeys nested inside join. (As mentioned by @jezrael in the comments, dict.fromkeys has better performance than using Series.drop_duplicates. Cf. his answer elsewhere.)
Change ','.join to ''.join, if you don't want a delimiter. If you want to re-order the columns of res so that they match the order of the original df, use res = res.loc[:,df.columns].