Pandas - Getting pair of columns and do calculation into new columns-CodePudding

Suppose I have the following starting dataset, where first column is a date, and second column onwards contains N columns of associated Bid and Ask. In the below example I list 3 associated Bid and Asks, but there can be more.

Date    Bid_1   Ask_1   Bid_2   Ask_2   Bid_3   Ask_3   
may-05   2.00    2.15    2.06    2.23    2.12    2.30   
may-06   2.03    2.18    2.09    2.25    2.15    2.31   
may-07   2.06    2.21    2.12    2.39    2.19    2.46   
may-08   2.09    2.24    2.15    2.31    2.22    2.38

The desired output is a dataframe that for each associated Bid Ask, calculates the "Bid Ask difference", such as the below:

Date    Bid_Ask_diff_1   Bid_Ask_diff_2   Bid_Ask_diff_3    
may-05       0.15            0.17            0.18   
may-06       0.15            0.16            0.16   
may-07       0.15            0.27            0.27   
may-08       0.15            0.16            0.16

I am really struggling with this as it should be able to handle a dynamic number of associated Bid and Asks. Would appreciate any guidance.

Thank you

CodePudding user response：

This is one solution, probably is not the best solution. You can seperate the different columns into two group:

bid1_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
bid2_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]

# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(bid2_df.values - bid1_df.values)

result_df.columns = [f'Bid_Ask_diff_{i}' for i in range(result_df.shape[1])]

or with numpy slicing (please check the 1, and 2 number and make sure that they are true).

res = pd.DataFrame(df.values[1::2] - df.values[::2], columns=[f'Bid_Ask_diff_{i}' for i in range(result_df.shape[1])])

CodePudding user response：

Here is my solution on your problem:

df = (pd.wide_to_long(df,stubnames=['Bid','Ask'],i='Date',j='case',sep='_')
       .apply(lambda row: row['Ask'] - row['Bid'],axis=1)
       .reset_index(name='Bid_Ask_diff')
       .set_index(['Date', 'case'])['Bid_Ask_diff'].unstack().add_prefix('Bid_Ask_diff_')
       .reset_index()
      )
print(df)

case    Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0     may-05            0.15            0.17            0.18
1     may-06            0.15            0.16            0.16
2     may-07            0.15            0.27            0.27
3     may-08            0.15            0.16            0.16

Any feedback appreciated on my solving path, since I didn't do this very often yet.

Another option, via wide_to_long, agg and unstack:

(
pd
.wide_to_long(df,stubnames=['Bid','Ask'],i='Date',j='case',sep='_')
.assign(diff = lambda df: df.iloc[:, ::-1]
                            .agg(np.subtract.reduce, axis = 1))
['diff']
.unstack()
.add_prefix('Bid_Ask_diff_')
.rename_axis(columns = None)
.reset_index()
)

     Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0  may-05            0.15            0.17            0.18
1  may-06            0.15            0.16            0.16
2  may-07            0.15            0.27            0.27
3  may-08            0.15            0.16            0.16

CodePudding user response：

One option that avoids flipping to long form (performance wise, the less the number of rows, the better) is to group on the axis (axis = 1) and do the subtraction based on the numbers:

temp = df.set_index('Date')
# get the tail of each column
grouper = temp.columns.str.split('_').str[-1]
(temp
.groupby(grouper, axis = 1)
.agg(np.subtract.reduce, axis = 1)
.mul(-1)
.add_prefix('Bid_Ask_diff_')
.reset_index()
)

   Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0  may-05            0.15            0.17            0.18
1  may-06            0.15            0.16            0.16
2  may-07            0.15            0.27            0.27
3  may-08            0.15            0.16            0.16

If you do want to go to long form, a possibly efficient way is to group all the numbers into individual columns (those that end in 1 go to 1, those in 2 go to 2, and so on), followed by a groupby and then the aggregation (again there is an increase in the number of rows, which the first option avoids) - pivot_longer from pyjanitor offers an easy syntax to achieve this:

# pip install pyjanitor
import janitor
import pandas as pd

(
df
.pivot_longer('Date', names_to = '.value', names_pattern = '. (\d)')
.groupby('Date')
.agg(np.subtract.reduce)
.mul(-1)
.add_prefix('Bid_Ask_diff_')
.reset_index()
)

     Date  Bid_Ask_diff_1  Bid_Ask_diff_2  Bid_Ask_diff_3
0  may-05            0.15            0.17            0.18
1  may-06            0.15            0.16            0.16
2  may-07            0.15            0.27            0.27
3  may-08            0.15            0.16            0.16