Suppose I have the following starting dataset, where first column is a date, and second column onwards contains N columns of associated Bid and Ask. In the below example I list 3 associated Bid and Asks, but there can be more.
Date Bid_1 Ask_1 Bid_2 Ask_2 Bid_3 Ask_3
may-05 2.00 2.15 2.06 2.23 2.12 2.30
may-06 2.03 2.18 2.09 2.25 2.15 2.31
may-07 2.06 2.21 2.12 2.39 2.19 2.46
may-08 2.09 2.24 2.15 2.31 2.22 2.38
The desired output is a dataframe that for each associated Bid Ask, calculates the "Bid Ask difference", such as the below:
Date Bid_Ask_diff_1 Bid_Ask_diff_2 Bid_Ask_diff_3
may-05 0.15 0.17 0.18
may-06 0.15 0.16 0.16
may-07 0.15 0.27 0.27
may-08 0.15 0.16 0.16
I am really struggling with this as it should be able to handle a dynamic number of associated Bid and Asks. Would appreciate any guidance.
Thank you
CodePudding user response:
This is one solution, probably is not the best solution. You can seperate the different columns into two group:
bid1_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
bid2_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]
# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(bid2_df.values - bid1_df.values)
result_df.columns = [f'Bid_Ask_diff_{i}' for i in range(result_df.shape[1])]
or with numpy slicing (please check the 1, and 2 number and make sure that they are true).
res = pd.DataFrame(df.values[1::2] - df.values[::2], columns=[f'Bid_Ask_diff_{i}' for i in range(result_df.shape[1])])
CodePudding user response:
Here is my solution on your problem:
df = (pd.wide_to_long(df,stubnames=['Bid','Ask'],i='Date',j='case',sep='_')
.apply(lambda row: row['Ask'] - row['Bid'],axis=1)
.reset_index(name='Bid_Ask_diff')
.set_index(['Date', 'case'])['Bid_Ask_diff'].unstack().add_prefix('Bid_Ask_diff_')
.reset_index()
)
print(df)
case Date Bid_Ask_diff_1 Bid_Ask_diff_2 Bid_Ask_diff_3
0 may-05 0.15 0.17 0.18
1 may-06 0.15 0.16 0.16
2 may-07 0.15 0.27 0.27
3 may-08 0.15 0.16 0.16
Any feedback appreciated on my solving path, since I didn't do this very often yet.
Another option, via wide_to_long, agg and unstack:
(
pd
.wide_to_long(df,stubnames=['Bid','Ask'],i='Date',j='case',sep='_')
.assign(diff = lambda df: df.iloc[:, ::-1]
.agg(np.subtract.reduce, axis = 1))
['diff']
.unstack()
.add_prefix('Bid_Ask_diff_')
.rename_axis(columns = None)
.reset_index()
)
Date Bid_Ask_diff_1 Bid_Ask_diff_2 Bid_Ask_diff_3
0 may-05 0.15 0.17 0.18
1 may-06 0.15 0.16 0.16
2 may-07 0.15 0.27 0.27
3 may-08 0.15 0.16 0.16
CodePudding user response:
One option that avoids flipping to long form (performance wise, the less the number of rows, the better) is to group on the axis (axis = 1) and do the subtraction based on the numbers:
temp = df.set_index('Date')
# get the tail of each column
grouper = temp.columns.str.split('_').str[-1]
(temp
.groupby(grouper, axis = 1)
.agg(np.subtract.reduce, axis = 1)
.mul(-1)
.add_prefix('Bid_Ask_diff_')
.reset_index()
)
Date Bid_Ask_diff_1 Bid_Ask_diff_2 Bid_Ask_diff_3
0 may-05 0.15 0.17 0.18
1 may-06 0.15 0.16 0.16
2 may-07 0.15 0.27 0.27
3 may-08 0.15 0.16 0.16
If you do want to go to long form, a possibly efficient way is to group all the numbers into individual columns (those that end in 1 go to 1, those in 2 go to 2, and so on), followed by a groupby and then the aggregation (again there is an increase in the number of rows, which the first option avoids) - pivot_longer from pyjanitor offers an easy syntax to achieve this:
# pip install pyjanitor
import janitor
import pandas as pd
(
df
.pivot_longer('Date', names_to = '.value', names_pattern = '. (\d)')
.groupby('Date')
.agg(np.subtract.reduce)
.mul(-1)
.add_prefix('Bid_Ask_diff_')
.reset_index()
)
Date Bid_Ask_diff_1 Bid_Ask_diff_2 Bid_Ask_diff_3
0 may-05 0.15 0.17 0.18
1 may-06 0.15 0.16 0.16
2 may-07 0.15 0.27 0.27
3 may-08 0.15 0.16 0.16