I have imported data from excel that looks like this:
Time_int <- c('01FEB2022_00:00:00.000000', '00:00 - 06:00', '06:00 - 12:00', '12:00 - 18:00', '18:00 - 00:00',
'02FEB2022_00:00:00.000000', '00:00 - 06:00', '06:00 - 12:00', '12:00 - 18:00', '18:00 - 00:00',
'03FEB2022_00:00:00.000000', '00:00 - 06:00', '06:00 - 12:00', '12:00 - 18:00', '18:00 - 00:00')
Values <- c(NA, 2682, 2689, 2690, 2686, NA, 2717, 2754, 2745, 2734, NA, 2735, 2746, 2753, 2701)
df <- cbind(Time_int, Values)
Time_int Values
01FEB2022_00:00:00.000000 <NA>
00:00 - 06:00 2682
06:00 - 12:00 2689
12:00 - 18:00 2690
18:00 - 00:00 2686
02FEB2022_00:00:00.000000 <NA>
00:00 - 06:00 2717
06:00 - 12:00 2754
12:00 - 18:00 2745
18:00 - 00:00 2734
03FEB2022_00:00:00.000000 <NA>
00:00 - 06:00 2735
06:00 - 12:00 2746
12:00 - 18:00 2753
18:00 - 00:00 2701
The issue with this is that the date may interrupt my time-series calculations. So I want to be able to have the date in its own column and the time intervals in a separate column so that the values are not disrupted by the date. I would like to produce it so that it looks more like this:
Date <- c('2022/02/01', '2022/02/01', '2022/02/01', '2022/02/01',
'2022/02/02', '2022/02/02', '2022/02/02', '2022/02/02',
'2022/02/03', '2022/02/03', '2022/02/03', '2022/02/03')
Int <- c('00:00 - 06:00', '06:00 - 12:00', '12:00 - 18:00', '18:00 - 00:00',
'00:00 - 06:00', '06:00 - 12:00', '12:00 - 18:00', '18:00 - 00:00',
'00:00 - 06:00', '06:00 - 12:00', '12:00 - 18:00', '18:00 - 00:00')
V <- c(2682, 2689, 2690, 2686, 2717, 2754, 2745, 2734, 2735, 2746, 2753, 2701)
df2 <- cbind(Date, Int, V)
Date Int V
2022-02-01 00:00 - 06:00 2682
2022-02-01 06:00 - 12:00 2689
2022-02-01 12:00 - 18:00 2690
2022-02-01 18:00 - 00:00 2686
2022-02-02 00:00 - 06:00 2717
2022-02-02 06:00 - 12:00 2754
2022-02-02 12:00 - 18:00 2745
2022-02-02 18:00 - 00:00 2734
2022-02-03 00:00 - 06:00 2735
2022-02-03 06:00 - 12:00 2746
2022-02-03 12:00 - 18:00 2753
2022-02-03 18:00 - 00:00 2701
I am also open to any other possible solutions since I am not entirely sure if this is the correct table format to use when testing intraday data.
CodePudding user response:
Here is a tidyverse
approach:
library(dplyr); library(tidyr)
df |>
data.frame() |>
mutate(
Date = as.Date(
gsub("_. ", "", Time_int), format = "%d%B%Y"
)
) |>
fill(Date) |>
drop_na(Values) |>
rename(
V = Values,
Int = Time_int
) |>
relocate(
Int, Date
)
# Int Date V
# 1 00:00 - 06:00 2022-02-01 2682
# 2 06:00 - 12:00 2022-02-01 2689
# 3 12:00 - 18:00 2022-02-01 2690
# 4 18:00 - 00:00 2022-02-01 2686
# 5 00:00 - 06:00 2022-02-02 2717
# 6 06:00 - 12:00 2022-02-02 2754
# 7 12:00 - 18:00 2022-02-02 2745
# 8 18:00 - 00:00 2022-02-02 2734
# 9 00:00 - 06:00 2022-02-03 2735
# 10 06:00 - 12:00 2022-02-03 2746
# 11 12:00 - 18:00 2022-02-03 2753
# 12 18:00 - 00:00 2022-02-03 2701
CodePudding user response:
Here is how we could do it:
- transform matrix class to tibble
- create Date column (duplicate of Time_int)
- create group with specific feature
00:00:00.000000
andcumsum
- extract date character and transform to date class with lubridates
dmy
function - fill and drop.na
library(tidyverse)
library(lubridate)
df %>%
as_tibble() %>%
mutate(Date = Time_int) %>%
group_by(group_id = cumsum(str_detect(Date, '00:00:00.000000'))) %>%
mutate(Date = dmy(str_extract(Date, '\\d{2}\\w \\d{4}'))) %>%
fill(Date, .direction = "down") %>%
drop_na() %>%
ungroup() %>%
select(Date, Int = Time_int, V = Values)
Date Int V
<date> <chr> <chr>
1 2022-02-01 00:00 - 06:00 2682
2 2022-02-01 06:00 - 12:00 2689
3 2022-02-01 12:00 - 18:00 2690
4 2022-02-01 18:00 - 00:00 2686
5 2022-02-02 00:00 - 06:00 2717
6 2022-02-02 06:00 - 12:00 2754
7 2022-02-02 12:00 - 18:00 2745
8 2022-02-02 18:00 - 00:00 2734
9 2022-02-03 00:00 - 06:00 2735
10 2022-02-03 06:00 - 12:00 2746
11 2022-02-03 12:00 - 18:00 2753
12 2022-02-03 18:00 - 00:00 2701