Home > Enterprise >  Pandas - Getting pair of columns and do calculation into new columns
Pandas - Getting pair of columns and do calculation into new columns

Time:04-18

Suppose I have the following starting dataset, where first column is a date, and second column onwards contains N columns of associated Bid and Ask. In the below example I list 3 associated Bid and Asks, but there can be more.

Date    Bid_1   Ask_1   Bid_2   Ask_2   Bid_3   Ask_3   
may-05   2.00    2.15    2.06    2.23    2.12    2.30   
may-06   2.03    2.18    2.09    2.25    2.15    2.31   
may-07   2.06    2.21    2.12    2.39    2.19    2.46   
may-08   2.09    2.24    2.15    2.31    2.22    2.38   

The desired output is a dataframe that for each associated Bid Ask, calculates the "Bid Ask difference", such as the below:

Date    Bid_Ask_diff_1   Bid_Ask_diff_2   Bid_Ask_diff_3    
may-05       0.15            0.17            0.18   
may-06       0.15            0.16            0.16   
may-07       0.15            0.27            0.27   
may-08       0.15            0.16            0.16   

I am really struggling with this as it should be able to handle a dynamic number of associated Bid and Asks. Would appreciate any guidance.

Thank you

CodePudding user response:

This is one solution, probably is not the best solution. You can seperate the different columns into two group:

bid1_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
bid2_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]

# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(bid2_df.values - bid1_df.values)

result_df.columns = [f'Bid_Ask_diff_{i}' for i in range(result_df.shape[1])]

or with numpy slicing (please check the 1, and 2 number and make sure that they are true).

res = pd.DataFrame(df.values[1::2] - df.values[::2], columns=[f'Bid_Ask_diff_{i}' for i in range(result_df.shape[1])])

CodePudding user response:

Here is my solution on your problem:

df = (pd.wide_to_long(df,stubnames=['Bid','Ask'],i='Date',j='case',sep='_')
       .apply(lambda row: row['Ask'] - row['Bid'],axis=1)
       .reset_index(name='Bid_Ask_diff')
       .set_index(['Date', 'case'])['Bid_Ask_diff'].unstack().add_prefix('Bid_Ask_diff_')
       .reset_index()
      )
print(df)

case    Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0     may-05            0.15            0.17            0.18
1     may-06            0.15            0.16            0.16
2     may-07            0.15            0.27            0.27
3     may-08            0.15            0.16            0.16

Any feedback appreciated on my solving path, since I didn't do this very often yet.

Another option, via wide_to_long, agg and unstack:

(
pd
.wide_to_long(df,stubnames=['Bid','Ask'],i='Date',j='case',sep='_')
.assign(diff = lambda df: df.iloc[:, ::-1]
                            .agg(np.subtract.reduce, axis = 1))
['diff']
.unstack()
.add_prefix('Bid_Ask_diff_')
.rename_axis(columns = None)
.reset_index()
)

     Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0  may-05            0.15            0.17            0.18
1  may-06            0.15            0.16            0.16
2  may-07            0.15            0.27            0.27
3  may-08            0.15            0.16            0.16

CodePudding user response:

One option that avoids flipping to long form (performance wise, the less the number of rows, the better) is to group on the axis (axis = 1) and do the subtraction based on the numbers:

temp = df.set_index('Date')
# get the tail of each column
grouper = temp.columns.str.split('_').str[-1]
(temp
.groupby(grouper, axis = 1)
.agg(np.subtract.reduce, axis = 1)
.mul(-1)
.add_prefix('Bid_Ask_diff_')
.reset_index()
)

   Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0  may-05            0.15            0.17            0.18
1  may-06            0.15            0.16            0.16
2  may-07            0.15            0.27            0.27
3  may-08            0.15            0.16            0.16

If you do want to go to long form, a possibly efficient way is to group all the numbers into individual columns (those that end in 1 go to 1, those in 2 go to 2, and so on), followed by a groupby and then the aggregation (again there is an increase in the number of rows, which the first option avoids) - pivot_longer from pyjanitor offers an easy syntax to achieve this:

# pip install pyjanitor
import janitor
import pandas as pd

(
df
.pivot_longer('Date', names_to = '.value', names_pattern = '. (\d)')
.groupby('Date')
.agg(np.subtract.reduce)
.mul(-1)
.add_prefix('Bid_Ask_diff_')
.reset_index()
)

     Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0  may-05            0.15            0.17            0.18
1  may-06            0.15            0.16            0.16
2  may-07            0.15            0.27            0.27
3  may-08            0.15            0.16            0.16
  • Related