Home > Software design >  Efficient Pandas Row Iteration for comparison
Efficient Pandas Row Iteration for comparison

Time:07-23

I have a large Dataframe based on market data from the online game EVE. I'm trying to determine the most profitable trades based on the price of the buy or sell order of an item. I've found that it takes quite a while to loop through all the possibilities and would like some advice on how to make my code more efficient.

data = https://market.fuzzwork.co.uk/orderbooks/latest.csv.gz

SETUP:

import pandas as pd
df = pd.read_csv('latest.csv', sep='\t', names=["orderID","typeID","issued","buy","volume","volumeEntered","minVolume","price","stationID","range","duration","region","orderSet"])

Iterate through all the possibilites

buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()

profitable_trade = []

for i in buy_order.index:
    for j in sell_order.index:

        if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
            profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist()   sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())

This takes quite a long time (33s on a ryzen 2600x, 12s on an M1 Pro)

Shorten the iteration

buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()

buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)

for i in buy_order.index:
    if buy_order.loc[i, 'price'] > sell_order.price.min():
        for j in sell_order.index:

            if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
                profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist()   sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
            else:
                break
    else:
        break

This shaves about 25%-30% off the time (23s on 2600x, 9s on the M1 Pro)

Times have been recorded in a Jupyter Notebook

Any Tips are welcome!

CodePudding user response:

Option 1 - Iterate through all the possibilites (yours):

start = time.time()
    
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
    
profitable_trade = []
    
for i in buy_order.index:
    for j in sell_order.index:
    
        if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
            profitable_trade.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist()   sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
    
stop = time.time()
print(f"Time: {stop - start} seconds")

Time: 33.145344734191895 seconds

Option 2 - Shorten the iteration (yours):

start = time.time()
    
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()
    
buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)
    
profitable_trade2 = []
    
for i in buy_order.index:
    if buy_order.loc[i, 'price'] > sell_order.price.min():
        for j in sell_order.index:
    
            if buy_order.loc[i,'price'] > sell_order.loc[j, 'price']:
                    profitable_trade2.append(buy_order.loc[i, ['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist()   sell_order.loc[j, ['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
            else:
                break
    else:
        break
    
stop = time.time()
print(f"Time: {stop - start} seconds")

Time: 26.736826419830322 seconds

Option 3 - Pandas Optimizations:

You can get some speedup by applying the following optimizations:

  • iterate over dataframe items directly (iterrows instead of index loc)
  • single filtering operation for sell-orders
start = time.time()
    
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
    
profitable_trade = []
    
for _, buy in buy_order.iterrows():
    filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
    for _, sell in filtered_sell_orders.iterrows():
        profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist()   sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())
    
stop = time.time()
print(f"Time: {stop - start} seconds")

Time: 19.43745183944702 seconds

Note that almost all the time is spent on the tolist()-operations (the following option is just for showing this impact, it does not return the target list):

start = time.time()

buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
    
profitable_trade = []
    
for _, buy in buy_order.iterrows():
    filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
    for _, sell in filtered_sell_orders.iterrows():
        # removed 'tolist'-operations
        profitable_trade.append(1)
    
stop = time.time()
print(f"Time: {stop - start} seconds")

Time: 2.072049617767334 seconds

Option 4 - Replace tolist-operations and store results in dataframe:

You can accelerate your code by

  • storing your filtered values in intermediate lists containing rows of the original dataframe
  • converting the intermediate lists to dataframes and concatenating them
  • the resulting dataframe yields the same information as the list profitable_trade
  • convert the dataframe to the desired list of lists (if needed)
start = time.time()

buy_orders = df[(df.typeID == 34) & (df.buy == True)]
sell_orders = df[(df.typeID == 34) & (df.buy == False)]

# store buy and cell rows in intermediate lists
buys = []
sells = []

for _, buy in buy_orders.iterrows():
    # apply filtering operation once
    filtered_sell_orders = sell_orders[sell_orders.price < buy.price]
    sell_rows = list(filtered_sell_orders.iterrows())

    # store buy and sell row items
    buys.extend([buy] * len(sell_rows))
    sells.extend([sell for _, sell in sell_rows])

# convert intermediate lists to dataframes
buys = pd.DataFrame(buys)
sells = pd.DataFrame(sells)

# rename columns for buys / cells dataframes for unique column names
buys = buys.rename(columns={column: f"{column}_buy"  for column in buys.columns})
sells = sells.rename(columns={column: f"{column}_sell"  for column in sells.columns})

# reset indices and concatenate buys / cells along the column axis
buys.reset_index(drop=True, inplace=True)
sells.reset_index(drop=True, inplace=True)
profitable_trade_df = pd.concat([buys, sells], axis=1)

# convert to list of lists (if needed)
profitable_trade = profitable_trade_df[['typeID_buy', 'orderID_buy', 'price_buy', 'volume_buy', 'stationID_buy', 'range_buy','orderID_sell', 'price_sell', 'volume_sell', 'stationID_sell', 'range_sell']].values.tolist()

stop = time.time()
print(f"Time: {stop - start} seconds")

Time: 3.785726308822632 seconds

CodePudding user response:

Many thanks to @daniel.fehrenbacher for the explanation and suggestions.

In addition to his options, I've found a few myself using this article: https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee#:

TL;DR

  • Don't use tolist()
  • Filter operation isn't always better, depends on the iteration method
  • There are much faster iteration methods than a regular for loop, or even iterrows(): use dictionary iteration

Use of .tolist() is detrimental

As mention in the answer above, a .tolist() uses too much time. It's much faster to use append([item1, item2, item3...]) than use append(row[['item1', 'item2', item3'...]].tolist())

  • tolist(): 19.2s
%%time
    
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
    
profitable_trade = []
    
for _, buy in buy_order.iterrows():
    filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
    for _, sell in filtered_sell_orders.iterrows():
        profitable_trade.append(buy[['typeID', 'orderID', 'price', 'volume', 'stationID', 'range']].tolist()   sell[['orderID', 'price', 'volume', 'stationID', 'range']].tolist())

  • append([item1, item2]): 3.5s
%%time
    
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]
    
profitable_trade = []
    
for _, buy in buy_order.iterrows():
    filtered_sell_orders = sell_order[sell_order["price"] < buy["price"]]
    for _, sell in filtered_sell_orders.iterrows():
        profitable_trade.append([
            buy.typeID,
            buy.orderID,
            buy.price,
            buy.volume,
            buy.stationID,
            buy.range,
            sell.orderID,
            sell.price,
            sell.volume,
            sell.stationID,
            sell.range
            ])

Filtering Operation VS break

While the single filtering operation has a slight efficiency increase when you use .iterrows(), I've found it is the opposite when you use the better .itertuples().

  • iterrows() with filter operation: 3.26s
%%time

buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]

profitable_trade = []
    
for _, row_buy in buy_order.iterrows():
    filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
    for _, row_sell in filtered_sell_orders.iterrows():
        profitable_trade.append([
                    row_buy.typeID,
                    row_buy.orderID,
                    row_buy.price,
                    row_buy.volume,
                    row_buy.stationID,
                    row_buy.range,
                    row_sell.orderID,
                    row_sell.price,
                    row_sell.volume,
                    row_sell.stationID,
                    row_sell.range
                ])
  • iterrows() with break statements: 3.77s
%%time

buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()

buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)

profitable_trade3 = []
lowest_sell = sell_order.price.min()

for _, row_buy in buy_order.iterrows():
    if row_buy.price > lowest_sell:
        for _, row_sell in sell_order.iterrows():

            if row_buy.price > row_sell.price:
                profitable_trade3.append([
                    row_buy.typeID,
                    row_buy.orderID,
                    row_buy.price,
                    row_buy.volume,
                    row_buy.stationID,
                    row_buy.range,
                    row_sell.orderID,
                    row_sell.price,
                    row_sell.volume,
                    row_sell.stationID,
                    row_sell.range
                ])
            else:
                break
    else:
        break
  • itertuples with filter operation: 650ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)]
sell_order = df[(df.typeID == 34) & (df.buy == False)]

profitable_trade = []
    
for row_buy in buy_order.itertuples():
    filtered_sell_orders = sell_order[sell_order["price"] < row_buy.price]
    for row_sell in filtered_sell_orders.itertuples():
        profitable_trade.append([
                    row_buy.typeID,
                    row_buy.orderID,
                    row_buy.price,
                    row_buy.volume,
                    row_buy.stationID,
                    row_buy.range,
                    row_sell.orderID,
                    row_sell.price,
                    row_sell.volume,
                    row_sell.stationID,
                    row_sell.range
                ])
  • itertuples with break statement: 375ms
%%time
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()

buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)

profitable_trade3 = []
lowest_sell = sell_order.price.min()

for row_buy in buy_order.itertuples():
    if row_buy.price > lowest_sell:
        for row_sell in sell_order.itertuples():

            if row_buy.price > row_sell.price:
                profitable_trade3.append([
                    row_buy.typeID,
                    row_buy.orderID,
                    row_buy.price,
                    row_buy.volume,
                    row_buy.stationID,
                    row_buy.range,
                    row_sell.orderID,
                    row_sell.price,
                    row_sell.volume,
                    row_sell.stationID,
                    row_sell.range
                ])
            else:
                break
    else:
        break

Better iteration methods

  • itertuples (see above): 375ms
  • Numpy Iteration Method (df.values): 200ms
buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()

buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)

profitable_trade4 = []
lowest_sell = sell_order.price.min()

for row_buy in buy_order.values:
    if row_buy[7] > lowest_sell:
        for row_sell in sell_order.values:

            if row_buy[7] > row_sell[7]:
                profitable_trade4.append([
                    row_buy[1],
                    row_buy[0],
                    row_buy[7],
                    row_buy[4],
                    row_buy[8],
                    row_buy[9],
                    row_sell[0],
                    row_sell[7],
                    row_sell[4],
                    row_sell[8],
                    row_sell[9]
                ])
            else:
                break
    else:
        break
  • Dictionary Iteration (df.to_dict('records')): 78ms
%%time

buy_order = df[(df.typeID == 34) & (df.buy == True)].copy()
sell_order = df[(df.typeID == 34) & (df.buy == False)].copy()

buy_order.sort_values(by='price', ascending=False, inplace=True, ignore_index=True)
sell_order.sort_values(by='price', ascending=True, inplace=True, ignore_index=True)

profitable_trade5 = []
buy_dict = buy_order.to_dict('records')
sell_dict = sell_order.to_dict('records')
lowest_sell = sell_order.price.min()

for row_buy in buy_dict:
    if row_buy['price'] > lowest_sell:
        for row_sell in sell_dict:

            if row_buy['price'] > row_sell['price']:
                profitable_trade5.append([
                    row_buy['typeID'],
                    row_buy['orderID'],
                    row_buy['price'],
                    row_buy['volume'],
                    row_buy['stationID'],
                    row_buy['range'],
                    row_sell['orderID'],
                    row_sell['price'],
                    row_sell['volume'],
                    row_sell['stationID'],
                    row_sell['range']
                ])
            else:
                break
    else:
        break

  • Related