Home > Software design >  Numpy - Vectorized calculation of a large csv file
Numpy - Vectorized calculation of a large csv file

Time:06-18

I have a 20 GB trades.csv file. It has two columns (trade_time and price). And the csv file contains 650 million rows.

Sample Data

https://gist.github.com/dsstex/bc885ed04a6de98afc7102ed08b78608

Pandas DataFrame

df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)

I would like to check whether price is up or down based on percentage. If the price hits up_value (e.g. 1%) first, the result is 1. If the price hits down_value (e.g. -0.5%) first, then the result is 0. I need to do this for all 650 million rows.

At the moment, the dataframe has only two columns. trade_time(index), price. I would like to have a new column named "result".

import pandas as pd

df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)
df["result"] = None

print(df)

up_percentage = 0.2
down_percentage = 0.1


def calc_value_from_percentage(percentage, whole):
    return (percentage / 100) * whole


def set_result(index):

    up_value = 0
    down_value = 0

    for _, current_row_price, _ in df.loc[index:].itertuples():
        if up_value == 0 or down_value == 0:

            up_delta = calc_value_from_percentage(up_percentage, current_row_price)
            down_delta = calc_value_from_percentage(down_percentage, current_row_price)

            up_value = current_row_price   up_delta
            down_value = current_row_price - down_delta

        if current_row_price > up_value:
            df.loc[index, "result"] = 1
            return

        if current_row_price < down_value:
            df.loc[index, "result"] = 0
            return


for ind, _, _ in df.itertuples():
    set_result(ind)

df.to_csv("results.csv", index=True, header=True)
print(df)

Results

https://gist.github.com/dsstex/fe3759beedbf9c46ace382a7eef3d12c

Note: Due to insufficient data, most of the bottom rows in the above file has the value "None" for "result". So the value is blank.


At the moment, I'm using pandas itertuples() to process the file. I would like to have a vectorized solution since I have a huge file.

Note: Last month I asked this question. This is a follow-up question. And it is related to this answer. In that answer, the author is using a fixed size up_value/down_value of 200. But I'm after a percentage based vectorized solution.

Any help is greatly appreciated.

Thanks

CodePudding user response:

Reading your full code I finally understood your algorithm.

For every index of the dataframe, you have to compute if the "result" is 1 or 0, so that:

  • 1 means that I find in the dataframe another price, current_row_price in the loop, that is greater than my original price - for current index, computed in the if block - by a up_delta value;
  • 0 means that I find another price in the df that is lower than my original price by a down_delta value.

I came up with this code. Maybe the loop is avoidable, but this should be faster.

from enum import Enum

import pandas as pd


class Result(int, Enum):
    DOWN = 0
    UP = 1
    MISSING = 2


df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)
df["result"] = Result.MISSING

# constants 
up_percentage = 0.2
down_percentage = 0.1

# compute upper and lower bound for every row
df["upper_bound"] = df["price"] * (1   up_percentage / 100)
df["lower_bound"] = df["price"] * (1 - down_percentage / 100)

# for each row get current upper and lower bounds, and check 
# in all dataframe if any row is greater/lower than these values
for i, row in df.iterrows():
    up_result: bool = (df["price"].loc[i:] > row["upper_bound"]).any()
    down_result: bool = (df["price"].loc[i:] < row["lower_bound"]).any()

    # priority over up_result, as in your original code
    if up_result:
        df.loc[i, "result"] = Result.UP
    elif down_result:
        df.loc[i, "result"] = Result.DOWN
    else:
        df.loc[i, "result"] = Result.MISSING


# remove utility columns
df.drop(columns=["upper_bound", "lower_bound"], inplace=True)

# store result
df.to_csv("results.csv", index=True, header=True)
  • Related