I have a dataset that looks like this:
Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0
and I wanted to fill in values for say 10-31 and 11-1 to contain the values from the previous trading day (10-30). How is this easily accomplished in R? I feel as though library(tidyr) and complete fit into this picture somehow?
Expected representation would be:
Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0
Requested dput output
structure(list(Date = c("2020-10-28", "2020-10-29", "2020-10-30",
"2020-11-02", "2020-11-03", "2020-11-04", "2020-11-05", "2020-11-06",
"2020-11-09", "2020-11-10"), Open = c(1384L, 1297L, 1283L, 1284L,
1263L, 1224L, 1194L, 1196L, 1207L, 1200L), High = c(1384L, 1297L,
1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L), Low = c(1384L,
1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
), Close = c(1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L,
1196L, 1207L, 1200L), Adjusted_close = c(1384L, 1297L, 1283L,
1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L), Volume = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 10L), class = "data.frame")
CodePudding user response:
first Date must be in date format
df$Date = as.Date(df$Date)
df %>%
full_join(data.frame(Date = seq(min(df$Date), max(df$Date), by = "days")),by = "Date") %>%
arrange(Date) %>%
fill(everything())
then make a join with a data that only contains the entire sequence of dates from the database, we sort it and use the fill function to fill them
CodePudding user response:
Solution
Here's a solution in the tidyverse
, which leverages the tidyr::fill()
function to fill in values from earlier rows:
library(tidyverse)
# ...
# Code to generate 'my_data'.
# ...
my_data %>%
# Ensure 'Date' column is proper datatype.
mutate(Date = as.Date(Date)) %>%
# Link to full range of dates, with blank rows for missing dates.
right_join(
# A temporary dataset with the full range of 'Date's.
tibble(Date = seq(from = min(.$Date), to = max(.$Date), by = "days")),
by = "Date"
) %>%
# Sort for filling: earlier above later.
arrange(Date) %>%
# Fill blank rows with values above.
fill(everything(), .direction = "down")
Results
Given my_data
like the data.frame
reproduced here
my_data <- structure(
list(
Date = c(
"2020-10-28", "2020-10-29", "2020-10-30", "2020-11-02", "2020-11-03",
"2020-11-04", "2020-11-05", "2020-11-06", "2020-11-09", "2020-11-10"
),
Open = c(
1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
),
High = c(
1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
),
Low = c(
1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
),
Close = c(
1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
),
Adjusted_close = c(
1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
),
Volume = c(
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
)
),
row.names = c(NA, 10L),
class = "data.frame"
)
this solution should yield a data.frame
like this:
Date Open High Low Close Adjusted_close Volume
1 2020-10-28 1384 1384 1384 1384 1384 0
2 2020-10-29 1297 1297 1297 1297 1297 0
3 2020-10-30 1283 1283 1283 1283 1283 0
4 2020-10-31 1283 1283 1283 1283 1283 0
5 2020-11-01 1283 1283 1283 1283 1283 0
6 2020-11-02 1284 1284 1284 1284 1284 0
7 2020-11-03 1263 1263 1263 1263 1263 0
8 2020-11-04 1224 1224 1224 1224 1224 0
9 2020-11-05 1194 1194 1194 1194 1194 0
10 2020-11-06 1196 1196 1196 1196 1196 0
11 2020-11-07 1196 1196 1196 1196 1196 0
12 2020-11-08 1196 1196 1196 1196 1196 0
13 2020-11-09 1207 1207 1207 1207 1207 0
14 2020-11-10 1200 1200 1200 1200 1200 0
CodePudding user response:
1) Convert to a zoo
class series z
using read.zoo
(this also converts Date
to Date
class) and then merge a zero width zoo object with all dates with z
. Fill in the missing values using na.locf
and finally convert back to data frame using fortify.zoo
. If a zoo object is ok as the result then omit the fortify.zoo
part.
library(zoo)
z <- read.zoo(dat)
out1 <- merge(z, zoo(, seq(start(z), end(z), "day"))) |>
na.locf() |>
fortify.zoo(name = "Date")
# check - target is defined in Note at the end
identical(out1, transform(target, Date = as.Date(Date)))
## [1] TRUE
2) In this alternative we use the following pipeline. Instead of using merge.zoo
, as above, this convert to ts class and back to expand the dates.
- Convert
dat
tozoo
class which also converts the index toDate
class. - Then convert that to
ts
class. Since that class only supports regularly spaced series the conversion will fill in the values corresponding to missing dates with NA's. na.locf
will then fill in those NA's.- use
fortify.zoo
to convert that back to a data frame. - Since
ts
class does not support Date indexes the Date column at this point is just numbers so convert those back toDate
class.
library(zoo)
out2 <- dat |>
read.zoo() |>
as.ts() |>
na.locf() |>
fortify.zoo(name = "Date") |>
transform(Date = as.Date(Date))
# check - target is defined in Note at the end
identical(out2, transform(target, Date = as.Date(Date)))
## [1] TRUE
Note
The input dat
and output target
in reproducible form are assumed to be:
Lines <- "Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0"
dat <- read.csv(text = Lines, strip.white = TRUE)
Lines2 <- "Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0"
target <- read.csv(text = Lines2, strip.white = TRUE)