I have a large Dataframe based on market data from the online game EVE. I'm trying to determine the most profitable trades based on the price of the buy or sell order of an item. I've found that it takes quite a while to loop through all the possibilities and would like some advice on how to make my code more efficient.
data = https://market.fuzzwork.co.uk/orderbooks/latest.csv.gz
SETUP:
import pandas as pd
df = pd.read_csv('latest.csv', sep='\t', names=["orderID","typeID","issued","buy","volume","volumeEntered","minVolume","price","stationID","range","duration","region","orderSet"])
Iterate through all the possibilites
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
This takes quite a long time (33s on a ryzen 2600x, 12s on an M1 Pro)
Shorten the iteration
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
This shaves about 25%-30% off the time (23s on 2600x, 9s on the M1 Pro)
Times have been recorded in a Jupyter Notebook
Any Tips are welcome!
CodePudding user response:
Option 1 - Iterate through all the possibilites (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
profitable_trade = []
for i in buy_order.index:
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 33.145344734191895 seconds
Option 2 - Shorten the iteration (yours):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade2 = []
for i in buy_order.index:
if buy_order.loc[i, 'price'] > sell_order.price.min():
for j in sell_order.index:
if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
else:
break
else:
break
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 26.736826419830322 seconds
Option 3 - Pandas Optimizations:
You can get some speedup by applying the following optimizations:
- iterate over dataframe items directly (iterrows instead of index loc)
- single filtering operation for sell-orders
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 19.43745183944702 seconds
Note that almost all the time is spent on the tolist()
-operations (the following option is just for showing this impact, it does not return the target list):
start = time.time()
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
# removed 'tolist'-operations
profitable_trade.append(1)
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 2.072049617767334 seconds
Option 4 - Replace tolist-operations and store results in dataframe:
You can accelerate your code by
- storing your filtered values in intermediate lists containing rows of the original dataframe
- converting the intermediate lists to dataframes and concatenating them
- the resulting dataframe yields the same information as the list
profitable_trade
- convert the dataframe to the desired list of lists (if needed)
start = time.time()
buy_orders = df[(df.typeID == 34) & (df.buy == True)]
sell_orders = df[(df.typeID == 34) & (df.buy == False)]
# store buy and cell rows in intermediate lists
buys = []
sells = []
for _, buy in buy_orders.iterrows():
# apply filtering operation once
filtered_sell_orders = sell_orders[sell_orders.price < buy.price]
sell_rows = list(filtered_sell_orders.iterrows())
# store buy and sell row items
buys.extend([buy] * len(sell_rows))
sells.extend([sell for _, sell in sell_rows])
# convert intermediate lists to dataframes
buys = pd.DataFrame(buys)
sells = pd.DataFrame(sells)
# rename columns for buys / cells dataframes for unique column names
buys = buys.rename(columns={column: f"{column}_buy" for column in buys.columns})
sells = sells.rename(columns={column: f"{column}_sell" for column in sells.columns})
# reset indices and concatenate buys / cells along the column axis
buys.reset_index(drop=True, inplace=True)
sells.reset_index(drop=True, inplace=True)
profitable_trade_df = pd.concat([buys, sells], axis=1)
# convert to list of lists (if needed)
profitable_trade = profitable_trade_df[['typeID_buy', 'orderID_buy', 'price_buy', 'volume_buy', 'stationID_buy', 'range_buy','orderID_sell', 'price_sell', 'volume_sell', 'stationID_sell', 'range_sell']].values.tolist()
stop = time.time()
print(f"Time: {stop - start} seconds")
Time: 3.785726308822632 seconds
CodePudding user response:
Many thanks to @daniel.fehrenbacher for the explanation and suggestions.
In addition to his options, I've found a few myself using this article: https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee#:
TL;DR
- Don't use
tolist()
- Filter operation isn't always better, depends on the iteration method
- There are much faster iteration methods than a regular for loop, or even
iterrows()
: use dictionary iteration
Use of .tolist()
is detrimental
As mention in the answer above, a .tolist()
uses too much time. It's much faster to use append([item1, item2, item3...])
than use append(row[['item1', 'item2', item3'...]].tolist())
tolist()
: 19.2s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist() sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
append([item1, item2])
: 3.5s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
for _, sell in filtered_sell_orders.iterrows():
profitable_trade.append([
buy.typeID,
buy.orderID,
buy.price,
buy.volume,
buy.stationID,
buy.range,
sell.orderID,
sell.price,
sell.volume,
sell.stationID,
sell.range
])
Filtering Operation VS break
While the single filtering operation has a slight efficiency increase when you use .iterrows()
, I've found it is the opposite when you use the better .itertuples()
.
iterrows()
with filter operation: 3.26s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for _, row_buy in buy_order.iterrows():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for _, row_sell in filtered_sell_orders.iterrows():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
iterrows()
withbreak
statements: 3.77s
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for _, row_buy in buy_order.iterrows():
if row_buy.price > lowest_sell:
for _, row_sell in sell_order.iterrows():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
itertuples
with filter operation: 650ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
profitable_trade = []
for row_buy in buy_order.itertuples():
filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
for row_sell in filtered_sell_orders.itertuples():
profitable_trade.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
itertuples
withbreak
statement: 375ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade3 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.itertuples():
if row_buy.price > lowest_sell:
for row_sell in sell_order.itertuples():
if row_buy.price > row_sell.price:
profitable_trade3.append([
row_buy.typeID,
row_buy.orderID,
row_buy.price,
row_buy.volume,
row_buy.stationID,
row_buy.range,
row_sell.orderID,
row_sell.price,
row_sell.volume,
row_sell.stationID,
row_sell.range
])
else:
break
else:
break
Better iteration methods
itertuples
(see above): 375ms- Numpy Iteration Method (
df.values
): 200ms
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade4 = []
lowest_sell = sell_order.price.min()
for row_buy in buy_order.values:
if row_buy[7] > lowest_sell:
for row_sell in sell_order.values:
if row_buy[7] > row_sell[7]:
profitable_trade4.append([
row_buy[1],
row_buy[0],
row_buy[7],
row_buy[4],
row_buy[8],
row_buy[9],
row_sell[0],
row_sell[7],
row_sell[4],
row_sell[8],
row_sell[9]
])
else:
break
else:
break
- Dictionary Iteration (
df.to_dict('records')
): 78ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
profitable_trade5 = []
buy_dict = buy_order.to_dict('records')
sell_dict = sell_order.to_dict('records')
lowest_sell = sell_order.price.min()
for row_buy in buy_dict:
if row_buy['price'] > lowest_sell:
for row_sell in sell_dict:
if row_buy['price'] > row_sell['price']:
profitable_trade5.append([
row_buy['typeID'],
row_buy['orderID'],
row_buy['price'],
row_buy['volume'],
row_buy['stationID'],
row_buy['range'],
row_sell['orderID'],
row_sell['price'],
row_sell['volume'],
row_sell['stationID'],
row_sell['range']
])
else:
break
else:
break