I have dataset of ticker information with timestamps spanning over a month, but I need to grab only the times at 4:00PM. The closest tick to this is 16:00:03, which is what I currently use. I hard coded the vales for the month of August by manually inputting the dates, but would like to change this so I can specify what month to use instead of inputting every day, or setting a start and end date.
df = df.loc[((df["timestamp"] == "2021-08-02 16:00:03")) |
((df["timestamp"] == "2021-08-03 16:00:03")) |
((df["timestamp"] == "2021-08-04 16:00:03")) |
((df["timestamp"] == "2021-08-05 16:00:03")) |
((df["timestamp"] == "2021-08-06 16:00:03")) ]
timestamp bidprice askprice
0 2021-08-02 14:59:03 99.937500 99.949219
1 2021-08-02 15:00:03 99.941406 99.945312
2 2021-08-02 15:01:03 99.941406 99.945312
3 2021-08-02 15:02:03 99.941406 99.945312
4 2021-08-02 15:03:03 99.941406 99.945312
...
timestamp bidprice askprice
468109 2021-09-01 22:55:02 110.500000 110.546875
468110 2021-09-01 22:56:02 110.500000 110.546875
468111 2021-09-01 22:57:02 110.500000 110.546875
468112 2021-09-01 22:58:02 110.484375 110.531250
468113 2021-09-01 22:59:02 110.484375 110.531250
CodePudding user response:
First, you want to convert the DateTime strings to timestamps
df['timestamp'] = pd.to_datetime(df['timestamp'])
Then isolate
df = df.loc[df['timestamp'] == datetime.time(hour=16, minute=3)]
Sorry for my untested code but this should at least put you on the right track.
CodePudding user response:
Use an asof
merge with a DataFrame that is a single Series with daily times at 16:00:00. You can specify the direction to be 'nearest'
, 'forward'
or 'backward'
to get the matching logic you'd like.
Sample Data
import numpy as np
import pandas as pd
N = 30000
np.random.seed(123)
df1 = pd.DataFrame({'timestamp': (pd.date_range('2021-08-01', freq='29s', periods=N)
pd.to_timedelta(np.random.normal(0,1,N), unit='ms')),
'value': range(N)})
Code
#Daily 16:00:00 DataFrame
start_date = '2021-08-01 16:00:00'
end_date = '2021-08-11 16:00:00'
dfbase = pd.DataFrame({'date': pd.date_range(start_date, end_date, freq='D')})
result = pd.merge_asof(dfbase, df1.sort_values('timestamp'),
left_on='date', right_on='timestamp',
direction='nearest', allow_exact_matches=True)
print(result)
date timestamp value
0 2021-08-01 16:00:00 2021-08-01 15:59:53.999784590 1986
1 2021-08-02 16:00:00 2021-08-02 16:00:14.000160424 4966
2 2021-08-03 16:00:00 2021-08-03 16:00:05.000322262 7945
3 2021-08-04 16:00:00 2021-08-04 15:59:55.998303052 10924
4 2021-08-05 16:00:00 2021-08-05 15:59:46.998877694 13903
5 2021-08-06 16:00:00 2021-08-06 16:00:06.998954204 16883
6 2021-08-07 16:00:00 2021-08-07 15:59:58.000602203 19862
7 2021-08-08 16:00:00 2021-08-08 15:59:49.001400290 22841
8 2021-08-09 16:00:00 2021-08-09 16:00:08.998636467 25821
9 2021-08-10 16:00:00 2021-08-10 15:59:59.998385577 28800
10 2021-08-11 16:00:00 2021-08-11 01:39:31.001659917 29999
Another option, that will give you a similar result as above but less information is to use DataFrame.asof
after setting the index. You can supply the Series of daily dates.
df1 = df1.set_index('timestamp')
df1.asof(dfbase.date)
value
date
2021-08-01 16:00:00 1986.0
2021-08-02 16:00:00 4965.0
2021-08-03 16:00:00 7944.0
2021-08-04 16:00:00 10924.0
2021-08-05 16:00:00 13903.0
2021-08-06 16:00:00 16882.0
2021-08-07 16:00:00 19862.0
2021-08-08 16:00:00 22841.0
2021-08-09 16:00:00 25820.0
2021-08-10 16:00:00 28800.0
2021-08-11 16:00:00 29999.0
So similar result (some rows are different because this enforces the 'backward'
direction for the match), but doesn't give you any information about which exact timestamp was matched, and also doesn't support setting a tolerance
to exclude bad matches (possibly like the last row which is still the closest, but is a bad match)