How to measure the time elapsed since the beginning of an event, and record it in a new dataframe co-CodePudding

I'm trying to measure the time elapsed since the beginning of an event. In this case, I want to know if the volume of bitcoin traded per minute has exceeded a certain threshold. Because what moves the price is the volume. So I want to measure how long there has been significant volume, and record this measurement in a new column.

Here is an example of a dataframe that contains the date in index, the bitcoin price and the volume. I added a column that indicates when the volume has exceeded a certain threshold:

df = pd.DataFrame({
Time' : ['2022-01-11 09:30:00', '2022-01-11 09:31:00', '2022-01-11 09:32:00', '2022-01- 
11 09:33:00', '2022-01-11 09:34:00', '2022-01-11 09:35:00',],
'Volume' : ['132', '109', '74', '57', '123', '21'],
'Volume_cat' : ['big_volume', 'big_volume', None, None, 'big_volume', 'None', ]
})

df['Time'] = pd.to_datetime(df['Time'])
df.set_index(['Time'], inplace =True)
df

My goal is to have a new column that will display the elapsed time (in seconds) since the last detection of the 'big_volume' event and will reset itself at each new detection. Here is a line that can be added to the example code:

df['delta_big_vol'] = ['60', '120', '180', '240', '60', '120',]
df

I have to use the apply() method, but have not found any lambda that would work. In pseudo code it would look like :

from datetime import timedelta
df['delta_xl_vol'] = df.apply(if df["Volume"] > 100 : return(timedelta.total_seconds))

Thanks for your help.

CodePudding user response：

For this process, we can't have null values in our "Volume_cat" column:

>>> df["Volume_cat"] = df["Volume_cat"].fillna("-")  # This could be any string except "big_volume"

Now, let's assign a group to each set of consecutive values in the "Volume_cat" column (consecutive "big_volume" are grouped, and consecutive "-" too).

>>> df["Group"] = ((df.Volume_cat != df.Volume_cat.shift()).cumsum())

Then, rank each group. The important part is to rank where we had null values. We'll override rows with "big_volume" too, because the time since last observation should be zero:

>>> df["rank"] = df.groupby("Group")["Volume_cat"].rank("first", ascending=False)
>>> df.loc[(df["Volume_cat"] == "big_volume"), "rank"] = 0.0  # Maybe you prefer 1.0?

Finally, we can use our "rank" column and multiply it by 60 to get the number of seconds since the last row with a "big_volume" observation. You can do this in a copy of your dataframe and then include the "delta_big_vol" column in your original dataframe, due to all this new columns in the output.

>>> df["delta_big_vol"] = df["rank"] * 60
>>> df
                    Volume  Volume_cat  Group  rank  delta_big_vol
Time                                                              
2022-01-11 09:30:00    132  big_volume      1   0.0            0.0
2022-01-11 09:31:00    109  big_volume      1   0.0            0.0
2022-01-11 09:32:00     74           -      2   1.0           60.0
2022-01-11 09:33:00     57           -      2   2.0          120.0
2022-01-11 09:34:00    123  big_volume      3   0.0            0.0
2022-01-11 09:35:00     21           -      4   1.0           60.0

CodePudding user response：

You are writing that "I have to use the apply() method". Is that true - because it wouldn't be a good idea?

Under the assumption that the Volume column contains numerical data (yours contains str data), you could do

threshold = 100
df['Result'] = (
    df.assign(Result=60).Result
      .groupby((df.Volume > threshold).cumsum()).cumsum()
)

with the result

                     Volume  Volume_cat  Result
Time                                           
2022-01-11 09:30:00     132  big_volume      60
2022-01-11 09:31:00     109  big_volume      60
2022-01-11 09:32:00      74        None     120
2022-01-11 09:33:00      57        None     180
2022-01-11 09:34:00     123  big_volume      60
2022-01-11 09:35:00      21        None     120

Or, if you prefer to start at 0, you could do

df['Result'] = (
    df.assign(Result=(df.Volume <= threshold) * 60).Result
      .groupby((df.Volume > threshold).cumsum()).cumsum()
)

with the result

                     Volume  Volume_cat  Result
Time                                           
2022-01-11 09:30:00     132  big_volume       0
2022-01-11 09:31:00     109  big_volume       0
2022-01-11 09:32:00      74        None      60
2022-01-11 09:33:00      57        None     120
2022-01-11 09:34:00     123  big_volume       0
2022-01-11 09:35:00      21        None      60