I am having issues with my code taking too long to run, and am unsure about how to make it more efficient/ run faster.
I have a data frame of about 3.5 million observations and have this code as my criteria within the for loop. If I drop the trades = trades.drop([i], axis=0) section then it takes seconds and doesn't drop the rows, however, with this line of code it hasn't even stopped running yet.
Ideally I would allocate all of the rows where the for loop is successful into another data frame as I know what to do from there, but I am unsure about how to do that which is why the .drop is there.
Any help would be much appreciated.
# Drop the rows/ trades that are greater than the ASK_PLUS_SPREAD
counter = 0
for i in range(0, len(trades['ASK_PLUS_SPREAD'])):
if trades['PRICE'][i] > trades['ASK_PLUS_SPREAD'][i]:
counter = counter 1
counts_q['F5'] = counter
trades = trades.drop([i], axis=0)
else:
pass
# Drop the rows/ trades that are lower than the BID_MINUS_SPREAD and add the amount taken to the counter
for i in range(0, len(trades['BID_MINUS_SPREAD'])):
if trades['PRICE'][i] < trades['BID_MINUS_SPREAD'][i]:
counter = counter 1
counts_q['F5'] = counter
trades = trades.drop([i], axis = 0)
else:
pass
CodePudding user response:
This will give you the trades df with only rows where PRICE
is greater than ASK_PLUS_SPREAD
.
trades = trades[trades['PRICE'] > trades['ASK_PLUS_SPREAD]]
If you want some sort of comparison of how many rows were removed, I'd look into comparing trades.shape
before and after the change.
CodePudding user response:
What you did in the line
trades = trades.drop([i], axis=0)
is to create a new DataFrame by dropping one row from the old DataFrame. That' s why it is slow. You can parallelize your operation by using Boolean indexing. Consider
trades['PRICE'] > trades['ASK_PLUS_SPREAD']
this takes the column PRICE
and column ASK_PLUS_SPREAD
as two series and compare, which the result is a new series, in Boolean.
Now you can do two things with it: Firstly,
count = (trades['PRICE'] > trades['ASK_PLUS_SPREAD']).sum()
which counts how many PRICE
greater than ASK_PLUS_SPREAD
(since True is regarded as 1 and False is 0 in this case). Secondly,
trades = trades[trades['PRICE'] > trades['ASK_PLUS_SPREAD']]
which selects only the rows that fits the condition. In fact, if you don't need to count which is less and which is greater, you can replace the entire code in your example with the following:
condition = (trades['PRICE'] > trades['ASK_PLUS_SPREAD']) | (trades['PRICE'] < trades['BID_MINUS_SPREAD'])
counter = condition.sum()
trades = trades[condition]
which |
is a bitwise OR. You need to use bitwise operation to combine boolean series.