I would like to improve the speed of the following code. The data set is a list of trades that I would like to stresstest, by simulating various parameters and have all the results stored in a table.
The way I'm performing this, it's by designing the range of parameters and then I iterate over their values, I initiate a copy of the dataset, I assign the value of the parameters to new columns and I concat everything in a huge dataframe.
I would like to know if someone has a good idea to avoid the three for loops to build the dataframe ?
'''
# Defining the range of parameters to simulate
volchange = range(-1,2)
spreadchange = range(-10,11)
flatchange = range(-10,11)
# the df where I store all the results
final_result = pd.DataFrame()
# Iterating over the range of parameters
for vol in volchange:
for spread in spreadchange:
for flat in flatchange:
# Creating a copy of the initial dataset, assigning the simulated values to three
# new columns and concat it with the rest, resulting in a dataframe which is
# several time the initial dataset with all the possible triplet of parameters
inter_pos = pos.copy()
inter_pos['vol_change[pts]'] = vol
inter_pos['spread_change[%]'] = spread
inter_pos['spot_change[%]'] = flat
final_result = pd.concat([final_result,inter_pos], axis = 0)
# Performing computation at dataframe level
final_result['sim_vol'] = final_result['vol_change[pts]'] final_result['ImpliedVolatility']
final_result['spread'].multiply(final_result['spread_change[%]'])/100
final_result['sim_spread'] = final_result['spread'] final_result['spread_change']
final_result['spot_change'] = final_result['spot'] * final_result['spot_change[%]']/100
final_result['sim_spot'] = final_result['spot'] final_result['spot_change']
final_result['sim_price'] = final_result['sim_spot'] - final_result['sim_spread']
'''
Thanks a lot for your help !
Have a nice week ahead !
CodePudding user response:
Concatenating pandas dataframes onto one another takes a long time. It's better to create a list of dataframes and then use pd.concat
to concatenate them all at once.
You can test this yourself like this:
import pandas as pd
import numpy as np
from time import time
dfs = []
columns = [f"{i:02d}" for i in range(100)]
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
dfs.append(df)
new_df = pd.concat(dfs)
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 1.851675271987915
new_df = pd.DataFrame(columns=columns)
time_start = time()
for i in range(100):
data = np.random.random((10000, 100))
df = pd.DataFrame(columns=columns, data=data)
new_df = pd.concat([new_df, df])
time_end = time()
print(f"Time elapsed: {time_end-time_start}")
# Time elapsed: 12.258363008499146
You can also use itertools.product
to get rid of your nested for loops.
Also as suggested by @Ahmed AEK:
you can pass
data=itertools.product(volchange, spreadchange ,flatchange )
topd.DataFrame
directly, and avoid creating the list altogether, which is a more memory efficient and faster approach