I'm wondering if there is a simple way to filter time with a date-time POSIXct variable.
I discovered non-equal filtering with time variable (hms
) is straightforward:
> apple_data
# A tibble: 10 × 6
SYMBOL DATE TIME BB BO date_time
<chr> <date> <time> <dbl> <dbl> <dttm>
1 AAPL 2009-01-02 09:30:00 85.6 85.6 2009-01-02 09:30:00
2 AAPL 2009-01-02 09:30:01 85.6 85.9 2009-01-02 09:30:01
3 AAPL 2009-01-02 09:30:02 85.6 85.7 2009-01-02 09:30:02
4 AAPL 2009-01-02 09:30:03 85.6 85.7 2009-01-02 09:30:03
5 AAPL 2009-01-02 09:30:04 85.6 85.8 2009-01-02 09:30:04
6 AAPL 2009-01-02 09:30:05 85.6 85.7 2009-01-02 09:30:05
7 AAPL 2009-01-02 09:30:06 85.6 85.7 2009-01-02 09:30:06
8 AAPL 2009-01-02 09:30:07 85.6 85.7 2009-01-02 09:30:07
9 AAPL 2009-01-02 09:30:08 85.6 85.7 2009-01-02 09:30:08
10 AAPL 2009-01-02 09:30:09 85.6 85.7 2009-01-02 09:30:09
apple_data %>% filter(TIME <= as_hms("09:30:05"), TIME >= as_hms("09:30:03"))
# A tibble: 3 × 6
SYMBOL DATE TIME BB BO date_time
<chr> <date> <time> <dbl> <dbl> <dttm>
1 AAPL 2009-01-02 09:30:03 85.6 85.7 2009-01-02 09:30:03
2 AAPL 2009-01-02 09:30:04 85.6 85.8 2009-01-02 09:30:04
3 AAPL 2009-01-02 09:30:05 85.6 85.7 2009-01-02 09:30:05
Question 1
If I do not have DATE
and TIME
variables but date_time
only instead, which is POSIXct
, how could I perform non-equi filtering only with time?
Question 2
I tried extracting TIME
from date_time
using format(date_time, "%T")
, and discovered time filtering can be done even though the output is a string. However, it takes too much time to convert string
to hms
on big data, and I need it for merging with other data.
Is there a fast way to convert string
to hms
, or extract hms
from date_time
from the beginning so that I can skip this costly type conversion? Any suggestions are greatly appreciated.
Reprex
structure(list(SYMBOL = structure(c("AAPL", "AAPL", "AAPL", "AAPL",
"AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL"), label = "Stock Symbol"),
DATE = structure(c(14246, 14246, 14246, 14246, 14246, 14246,
14246, 14246, 14246, 14246), label = "Quote date", format.sas = "YYMMDDN8", class = "Date"),
TIME = structure(c(34200, 34201, 34202, 34203, 34204, 34205,
34206, 34207, 34208, 34209), class = c("hms", "difftime"), units = "secs"),
BB = structure(c(85.55, 85.6, 85.56, 85.55, 85.57, 85.56,
85.61, 85.61, 85.62, 85.62), label = "Best Bid"), BO = structure(c(85.6,
85.86, 85.66, 85.66, 85.8, 85.66, 85.66, 85.66, 85.73, 85.73
), label = "Best Offer"), date_time = structure(c(1230888600,
1230888601, 1230888602, 1230888603, 1230888604, 1230888605,
1230888606, 1230888607, 1230888608, 1230888609), tzone = "UTC", format.sas = "DATETIME20", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
CodePudding user response:
1) Calculate the range of seconds, rng
, in the comparison times and also in the time of date_time
. That avoids character processing for date_time
. Note that 86400 equals 24 * 60 * 60.
library(dplyr, exclude = c("filter", "lag"))
rng <- as.difftime(c("09:30:03", "09:30:05"), unit = "secs")
apple_data %>%
dplyr::filter(between(as.numeric(date_time) %% 86400, !!!rng))
giving:
# A tibble: 3 × 6
SYMBOL DATE TIME BB BO date_time
<chr> <date> <hms> <dbl> <dbl> <dttm>
1 AAPL 2009-01-02 34203 secs 85.6 85.7 2009-01-02 09:30:03
2 AAPL 2009-01-02 34204 secs 85.6 85.8 2009-01-02 09:30:04
3 AAPL 2009-01-02 34205 secs 85.6 85.7 2009-01-02 09:30:05
2) A base R version of the above is nearly the same.
Between <- function(x, ..., rng = range(c(...))) x >= rng[1] & x <= rng[2]
rng <- as.difftime(c("09:30:03", "09:30:05"), unit = "secs")
apple_data |>
subset(Between(as.numeric(date_time) %% 86400, rng))