I have a 20 GB trades.csv
file. It has two columns (trade_time and price). And the csv file contains 650 million rows.
Sample Data
https://gist.github.com/dsstex/bc885ed04a6de98afc7102ed08b78608
Pandas DataFrame
df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)
I would like to check whether price is up or down based on percentage. If the price hits up_value (e.g. 1%) first, the result is 1. If the price hits down_value (e.g. -0.5%) first, then the result is 0. I need to do this for all 650 million rows.
At the moment, the dataframe has only two columns. trade_time(index), price
. I would like to have a new column named "result".
import pandas as pd
df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)
df["result"] = None
print(df)
up_percentage = 0.2
down_percentage = 0.1
def calc_value_from_percentage(percentage, whole):
return (percentage / 100) * whole
def set_result(index):
up_value = 0
down_value = 0
for _, current_row_price, _ in df.loc[index:].itertuples():
if up_value == 0 or down_value == 0:
up_delta = calc_value_from_percentage(up_percentage, current_row_price)
down_delta = calc_value_from_percentage(down_percentage, current_row_price)
up_value = current_row_price up_delta
down_value = current_row_price - down_delta
if current_row_price > up_value:
df.loc[index, "result"] = 1
return
if current_row_price < down_value:
df.loc[index, "result"] = 0
return
for ind, _, _ in df.itertuples():
set_result(ind)
df.to_csv("results.csv", index=True, header=True)
print(df)
Results
https://gist.github.com/dsstex/fe3759beedbf9c46ace382a7eef3d12c
Note: Due to insufficient data, most of the bottom rows in the above file has the value "None" for "result". So the value is blank.
At the moment, I'm using pandas itertuples()
to process the file. I would like to have a vectorized solution since I have a huge file.
Note: Last month I asked this question. This is a follow-up question. And it is related to this answer. In that answer, the author is using a fixed size up_value/down_value
of 200
. But I'm after a percentage based vectorized solution.
Any help is greatly appreciated.
Thanks
CodePudding user response:
Reading your full code I finally understood your algorithm.
For every index of the dataframe, you have to compute if the "result" is 1 or 0, so that:
1
means that I find in the dataframe another price,current_row_price
in the loop, that is greater than my original price - for current index, computed in theif
block - by aup_delta
value;0
means that I find another price in thedf
that is lower than my original price by adown_delta
value.
I came up with this code. Maybe the loop is avoidable, but this should be faster.
from enum import Enum
import pandas as pd
class Result(int, Enum):
DOWN = 0
UP = 1
MISSING = 2
df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)
df["result"] = Result.MISSING
# constants
up_percentage = 0.2
down_percentage = 0.1
# compute upper and lower bound for every row
df["upper_bound"] = df["price"] * (1 up_percentage / 100)
df["lower_bound"] = df["price"] * (1 - down_percentage / 100)
# for each row get current upper and lower bounds, and check
# in all dataframe if any row is greater/lower than these values
for i, row in df.iterrows():
up_result: bool = (df["price"].loc[i:] > row["upper_bound"]).any()
down_result: bool = (df["price"].loc[i:] < row["lower_bound"]).any()
# priority over up_result, as in your original code
if up_result:
df.loc[i, "result"] = Result.UP
elif down_result:
df.loc[i, "result"] = Result.DOWN
else:
df.loc[i, "result"] = Result.MISSING
# remove utility columns
df.drop(columns=["upper_bound", "lower_bound"], inplace=True)
# store result
df.to_csv("results.csv", index=True, header=True)