I have a dataset that looks like this:

Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0

and I wanted to fill in values for say 10-31 and 11-1 to contain the values from the previous trading day (10-30). How is this easily accomplished in R? I feel as though library(tidyr) and complete fit into this picture somehow?

Expected representation would be:

Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0

Requested dput output

structure(list(Date = c("2020-10-28", "2020-10-29", "2020-10-30", 
"2020-11-02", "2020-11-03", "2020-11-04", "2020-11-05", "2020-11-06", 
"2020-11-09", "2020-11-10"), Open = c(1384L, 1297L, 1283L, 1284L, 
1263L, 1224L, 1194L, 1196L, 1207L, 1200L), High = c(1384L, 1297L, 
1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L), Low = c(1384L, 
1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
), Close = c(1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 
1196L, 1207L, 1200L), Adjusted_close = c(1384L, 1297L, 1283L, 
1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L), Volume = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 10L), class = "data.frame")

CodePudding user response：

first Date must be in date format

df$Date = as.Date(df$Date)

df %>% 
  full_join(data.frame(Date = seq(min(df$Date), max(df$Date), by = "days")),by = "Date") %>% 
  arrange(Date) %>% 
  fill(everything())

then make a join with a data that only contains the entire sequence of dates from the database, we sort it and use the fill function to fill them

CodePudding user response：

Solution

Here's a solution in the tidyverse, which leverages the tidyr::fill() function to fill in values from earlier rows:

library(tidyverse)


# ...
# Code to generate 'my_data'.
# ...


my_data %>%
  # Ensure 'Date' column is proper datatype.
  mutate(Date = as.Date(Date)) %>%
  # Link to full range of dates, with blank rows for missing dates.
  right_join(
    # A temporary dataset with the full range of 'Date's.
    tibble(Date = seq(from = min(.$Date), to = max(.$Date), by = "days")),
    by = "Date"
  ) %>%
  # Sort for filling: earlier above later.
  arrange(Date) %>%
  # Fill blank rows with values above.
  fill(everything(), .direction = "down")

Results

Given my_data like the data.frame reproduced here

my_data <- structure(
  list(
    Date = c(
      "2020-10-28", "2020-10-29", "2020-10-30", "2020-11-02", "2020-11-03",
      "2020-11-04", "2020-11-05", "2020-11-06", "2020-11-09", "2020-11-10"
    ),
    Open = c(
      1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
    ),
    High = c(
      1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
    ),
    Low = c(
      1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
    ),
    Close = c(
      1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
    ),
    Adjusted_close = c(
      1384L, 1297L, 1283L, 1284L, 1263L, 1224L, 1194L, 1196L, 1207L, 1200L
    ),
    Volume = c(
      0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
    )
  ),
  row.names = c(NA, 10L),
  class = "data.frame"
)

this solution should yield a data.frame like this:

         Date Open High  Low Close Adjusted_close Volume
1  2020-10-28 1384 1384 1384  1384           1384      0
2  2020-10-29 1297 1297 1297  1297           1297      0
3  2020-10-30 1283 1283 1283  1283           1283      0
4  2020-10-31 1283 1283 1283  1283           1283      0
5  2020-11-01 1283 1283 1283  1283           1283      0
6  2020-11-02 1284 1284 1284  1284           1284      0
7  2020-11-03 1263 1263 1263  1263           1263      0
8  2020-11-04 1224 1224 1224  1224           1224      0
9  2020-11-05 1194 1194 1194  1194           1194      0
10 2020-11-06 1196 1196 1196  1196           1196      0
11 2020-11-07 1196 1196 1196  1196           1196      0
12 2020-11-08 1196 1196 1196  1196           1196      0
13 2020-11-09 1207 1207 1207  1207           1207      0
14 2020-11-10 1200 1200 1200  1200           1200      0

CodePudding user response：

1) Convert to a zoo class series z using read.zoo (this also converts Date to Date class) and then merge a zero width zoo object with all dates with z. Fill in the missing values using na.locf and finally convert back to data frame using fortify.zoo. If a zoo object is ok as the result then omit the fortify.zoo part.

library(zoo)

z <- read.zoo(dat)
out1 <- merge(z, zoo(, seq(start(z), end(z), "day"))) |> 
  na.locf() |>
  fortify.zoo(name = "Date")

# check - target is defined in Note at the end
identical(out1, transform(target, Date = as.Date(Date)))
## [1] TRUE

2) In this alternative we use the following pipeline. Instead of using merge.zoo, as above, this convert to ts class and back to expand the dates.

Convert dat to zoo class which also converts the index to Date class.
Then convert that to ts class. Since that class only supports regularly spaced series the conversion will fill in the values corresponding to missing dates with NA's.
na.locf will then fill in those NA's.
use fortify.zoo to convert that back to a data frame.
Since ts class does not support Date indexes the Date column at this point is just numbers so convert those back to Date class.

library(zoo)

out2 <- dat |> 
  read.zoo() |>
  as.ts() |>
  na.locf() |>
  fortify.zoo(name = "Date") |>
  transform(Date = as.Date(Date))

# check - target is defined in Note at the end    
identical(out2, transform(target, Date = as.Date(Date)))
## [1] TRUE

Note

The input dat and output target in reproducible form are assumed to be:

Lines <- "Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0"
dat <- read.csv(text = Lines, strip.white = TRUE)

Lines2 <- "Date,Open,High,Low,Close,Adjusted_close,Volume
2020-10-28,1384,1384,1384,1384,1384,0
2020-10-29,1297,1297,1297,1297,1297,0
2020-10-30,1283,1283,1283,1283,1283,0
2020-10-31,1283,1283,1283,1283,1283,0
2020-11-01,1283,1283,1283,1283,1283,0
2020-11-02,1284,1284,1284,1284,1284,0
2020-11-03,1263,1263,1263,1263,1263,0
2020-11-04,1224,1224,1224,1224,1224,0
2020-11-05,1194,1194,1194,1194,1194,0
2020-11-06,1196,1196,1196,1196,1196,0
2020-11-07,1196,1196,1196,1196,1196,0
2020-11-08,1196,1196,1196,1196,1196,0
2020-11-09,1207,1207,1207,1207,1207,0
2020-11-10,1200,1200,1200,1200,1200,0"
target <- read.csv(text = Lines2, strip.white = TRUE)