I'm trying to measure the time elapsed since the beginning of an event. In this case, I want to know if the volume of bitcoin traded per minute has exceeded a certain threshold. Because what moves the price is the volume. So I want to measure how long there has been significant volume, and record this measurement in a new column.
Here is an example of a dataframe that contains the date in index, the bitcoin price and the volume. I added a column that indicates when the volume has exceeded a certain threshold:
df = pd.DataFrame({
Time' : ['2022-01-11 09:30:00', '2022-01-11 09:31:00', '2022-01-11 09:32:00', '2022-01-
11 09:33:00', '2022-01-11 09:34:00', '2022-01-11 09:35:00',],
'Volume' : ['132', '109', '74', '57', '123', '21'],
'Volume_cat' : ['big_volume', 'big_volume', None, None, 'big_volume', 'None', ]
})
df['Time'] = pd.to_datetime(df['Time'])
df.set_index(['Time'], inplace =True)
df
My goal is to have a new column that will display the elapsed time (in seconds) since the last detection of the 'big_volume' event and will reset itself at each new detection. Here is a line that can be added to the example code:
df['delta_big_vol'] = ['60', '120', '180', '240', '60', '120',]
df
I have to use the apply() method, but have not found any lambda that would work. In pseudo code it would look like :
from datetime import timedelta
df['delta_xl_vol'] = df.apply(if df["Volume"] > 100 : return(timedelta.total_seconds))
Thanks for your help.
CodePudding user response:
For this process, we can't have null values in our "Volume_cat" column:
>>> df["Volume_cat"] = df["Volume_cat"].fillna("-") # This could be any string except "big_volume"
Now, let's assign a group to each set of consecutive values in the "Volume_cat" column (consecutive "big_volume" are grouped, and consecutive "-" too).
>>> df["Group"] = ((df.Volume_cat != df.Volume_cat.shift()).cumsum())
Then, rank each group. The important part is to rank where we had null values. We'll override rows with "big_volume" too, because the time since last observation should be zero:
>>> df["rank"] = df.groupby("Group")["Volume_cat"].rank("first", ascending=False)
>>> df.loc[(df["Volume_cat"] == "big_volume"), "rank"] = 0.0 # Maybe you prefer 1.0?
Finally, we can use our "rank" column and multiply it by 60 to get the number of seconds since the last row with a "big_volume" observation. You can do this in a copy of your dataframe and then include the "delta_big_vol" column in your original dataframe, due to all this new columns in the output.
>>> df["delta_big_vol"] = df["rank"] * 60
>>> df
Volume Volume_cat Group rank delta_big_vol
Time
2022-01-11 09:30:00 132 big_volume 1 0.0 0.0
2022-01-11 09:31:00 109 big_volume 1 0.0 0.0
2022-01-11 09:32:00 74 - 2 1.0 60.0
2022-01-11 09:33:00 57 - 2 2.0 120.0
2022-01-11 09:34:00 123 big_volume 3 0.0 0.0
2022-01-11 09:35:00 21 - 4 1.0 60.0
CodePudding user response:
You are writing that "I have to use the apply()
method". Is that true - because it wouldn't be a good idea?
Under the assumption that the Volume
column contains numerical data (yours contains str
data), you could do
threshold = 100
df['Result'] = (
df.assign(Result=60).Result
.groupby((df.Volume > threshold).cumsum()).cumsum()
)
with the result
Volume Volume_cat Result
Time
2022-01-11 09:30:00 132 big_volume 60
2022-01-11 09:31:00 109 big_volume 60
2022-01-11 09:32:00 74 None 120
2022-01-11 09:33:00 57 None 180
2022-01-11 09:34:00 123 big_volume 60
2022-01-11 09:35:00 21 None 120
Or, if you prefer to start at 0
, you could do
df['Result'] = (
df.assign(Result=(df.Volume <= threshold) * 60).Result
.groupby((df.Volume > threshold).cumsum()).cumsum()
)
with the result
Volume Volume_cat Result
Time
2022-01-11 09:30:00 132 big_volume 0
2022-01-11 09:31:00 109 big_volume 0
2022-01-11 09:32:00 74 None 60
2022-01-11 09:33:00 57 None 120
2022-01-11 09:34:00 123 big_volume 0
2022-01-11 09:35:00 21 None 60