Is there pandas functionality to grab specific timestamps from a datetime column on a daily basis?-CodePudding

I have dataset of ticker information with timestamps spanning over a month, but I need to grab only the times at 4:00PM. The closest tick to this is 16:00:03, which is what I currently use. I hard coded the vales for the month of August by manually inputting the dates, but would like to change this so I can specify what month to use instead of inputting every day, or setting a start and end date.

df = df.loc[((df["timestamp"] == "2021-08-02 16:00:03")) | 
           ((df["timestamp"] == "2021-08-03 16:00:03")) |
           ((df["timestamp"] == "2021-08-04 16:00:03")) |
           ((df["timestamp"] == "2021-08-05 16:00:03")) |
           ((df["timestamp"] == "2021-08-06 16:00:03")) ]

    timestamp           bidprice    askprice
0   2021-08-02 14:59:03 99.937500   99.949219
1   2021-08-02 15:00:03 99.941406   99.945312
2   2021-08-02 15:01:03 99.941406   99.945312
3   2021-08-02 15:02:03 99.941406   99.945312
4   2021-08-02 15:03:03 99.941406   99.945312
...
        timestamp           bidprice    askprice
468109  2021-09-01 22:55:02 110.500000  110.546875
468110  2021-09-01 22:56:02 110.500000  110.546875
468111  2021-09-01 22:57:02 110.500000  110.546875
468112  2021-09-01 22:58:02 110.484375  110.531250
468113  2021-09-01 22:59:02 110.484375  110.531250

CodePudding user response：

First, you want to convert the DateTime strings to timestamps

df['timestamp'] = pd.to_datetime(df['timestamp'])

Then isolate

df = df.loc[df['timestamp'] == datetime.time(hour=16, minute=3)]

Sorry for my untested code but this should at least put you on the right track.

CodePudding user response：

Use an asof merge with a DataFrame that is a single Series with daily times at 16:00:00. You can specify the direction to be 'nearest', 'forward' or 'backward' to get the matching logic you'd like.

Sample Data

import numpy as np
import pandas as pd

N = 30000
np.random.seed(123)
df1 = pd.DataFrame({'timestamp': (pd.date_range('2021-08-01', freq='29s', periods=N)
                                    pd.to_timedelta(np.random.normal(0,1,N), unit='ms')),
                    'value': range(N)})

Code

#Daily 16:00:00 DataFrame
start_date = '2021-08-01 16:00:00'
end_date = '2021-08-11 16:00:00'
dfbase = pd.DataFrame({'date': pd.date_range(start_date, end_date, freq='D')})

result = pd.merge_asof(dfbase, df1.sort_values('timestamp'), 
                       left_on='date', right_on='timestamp',
                       direction='nearest', allow_exact_matches=True)

print(result)

                  date                     timestamp  value
0  2021-08-01 16:00:00 2021-08-01 15:59:53.999784590   1986
1  2021-08-02 16:00:00 2021-08-02 16:00:14.000160424   4966
2  2021-08-03 16:00:00 2021-08-03 16:00:05.000322262   7945
3  2021-08-04 16:00:00 2021-08-04 15:59:55.998303052  10924
4  2021-08-05 16:00:00 2021-08-05 15:59:46.998877694  13903
5  2021-08-06 16:00:00 2021-08-06 16:00:06.998954204  16883
6  2021-08-07 16:00:00 2021-08-07 15:59:58.000602203  19862
7  2021-08-08 16:00:00 2021-08-08 15:59:49.001400290  22841
8  2021-08-09 16:00:00 2021-08-09 16:00:08.998636467  25821
9  2021-08-10 16:00:00 2021-08-10 15:59:59.998385577  28800
10 2021-08-11 16:00:00 2021-08-11 01:39:31.001659917  29999

Another option, that will give you a similar result as above but less information is to use DataFrame.asof after setting the index. You can supply the Series of daily dates.

df1 = df1.set_index('timestamp')
df1.asof(dfbase.date)

                       value
date                        
2021-08-01 16:00:00   1986.0
2021-08-02 16:00:00   4965.0
2021-08-03 16:00:00   7944.0
2021-08-04 16:00:00  10924.0
2021-08-05 16:00:00  13903.0
2021-08-06 16:00:00  16882.0
2021-08-07 16:00:00  19862.0
2021-08-08 16:00:00  22841.0
2021-08-09 16:00:00  25820.0
2021-08-10 16:00:00  28800.0
2021-08-11 16:00:00  29999.0

So similar result (some rows are different because this enforces the 'backward' direction for the match), but doesn't give you any information about which exact timestamp was matched, and also doesn't support setting a tolerance to exclude bad matches (possibly like the last row which is still the closest, but is a bad match)