manipulate data from nested df and lists-CodePudding

I want to study some stocks or financial indices. I ise the package yfR and the yf_get function to download data from Yahoo Finance. This function returns a df with numerous variables. I want to select some of them and them create a df only with the wanted variables. Here is my problem:

library(yfR)
library(tidyverse)

Symbols <- c("^GSPC", "^FTSE")

StartDate <- "2010-01-01"
EndDate <- "2019-12-31"

RawData <- yf_get (Symbols, first_date = StartDate, last_date = EndDate, freq_data = "daily" ,do_complete_data = TRUE)

# Here is the initial structure of the RawDat df

str(RawData)

tibble [5,039 × 11] (S3: tbl_df/tbl/data.frame)
 $ ticker                : chr [1:5039] "^FTSE" "^FTSE" "^FTSE" "^FTSE" ...
 $ ref_date              : Date[1:5039], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ price_open            : num [1:5039] 5413 5500 5522 5530 5527 ...
 $ price_high            : num [1:5039] 5500 5536 5536 5552 5549 ...
 $ price_low             : num [1:5039] 5411 5481 5498 5500 5495 ...
 $ price_close           : num [1:5039] 5500 5522 5530 5527 5534 ...
 $ volume                : num [1:5039] 7.51e 08 1.15e 09 9.98e 08 1.16e 09 1.01e 09 ...
 $ price_adjusted        : num [1:5039] 5500 5522 5530 5527 5534 ...
 $ ret_adjusted_prices   : num [1:5039] NA 0.004036 0.001358 -0.000597 0.001357 ...
 $ ret_closing_prices    : num [1:5039] NA 0.004036 0.001358 -0.000597 0.001357 ...
 $ cumret_adjusted_prices: num [1:5039] 1 1 1.01 1 1.01 ...
 - attr(*, "df_control")= tibble [2 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ ticker              : chr [1:2] "^FTSE" "^GSPC"
  ..$ dl_status           : chr [1:2] "OK" "OK"
  ..$ n_rows              : int [1:2] 2524 2515
  ..$ perc_benchmark_dates: num [1:2] 0.982 1
  ..$ threshold_decision  : chr [1:2] "KEEP" "KEEP"

Note that the indices or stocks we want may have different length (different number of obs) because of some national holidays and so on. So right now we have 2515 obs of GSPC and 2524 of FTSE. Lets say I am interested in keeping the columns ref_date, price_adjusted and ticker(in order to use as filtering mechanism somehow later). I tried to pipe until a certain point and it goes like this:

Returns <- RawData %>% 
    select(ref_date, price_adjusted, ticker) %>% 
    rename(Date = ref_date, Price = price_adjusted, Ticker = ticker)


# And we end up with this 

str(Returns)

tibble [5,039 × 3] (S3: tbl_df/tbl/data.frame)
 $ Date  : Date[1:5039], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ Price : num [1:5039] 5500 5522 5530 5527 5534 ...
 $ Ticker: chr [1:5039] "^FTSE" "^FTSE" "^FTSE" "^FTSE" ...
 - attr(*, "df_control")= tibble [2 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ ticker              : chr [1:2] "^FTSE" "^GSPC"
  ..$ dl_status           : chr [1:2] "OK" "OK"
  ..$ n_rows              : int [1:2] 2524 2515
  ..$ perc_benchmark_dates: num [1:2] 0.982 1
  ..$ threshold_decision  : chr [1:2] "KEEP" "KEEP"

Here comes my problem. I want the end product to be a df with 4 columns (Date_Stock1, Price_Stock1, Date_Stock2, Price_Stock2). If i had 3 stocks and 3 variables the end product would have been a df with 9 columns (Date_Stock1, Price_Stock1, Volume_Stock1, Date_Stock2, Price_Stock2, Volume_Stock1, Date_Stock3, Price_Stock3, Volume_Stock3)

I tried to use filter and subset fron tidyr but i failed. My best attemp was to use pivot_wider with the result been a df with 4 columns with 1 row and inside i got lists with the values and i dont know how to revert them back to a df.

Returns <- RawData %>%
  select(ref_date, price_adjusted, ticker) %>% 
  rename(Date = ref_date, Price = price_adjusted, Ticker = ticker) %>% 
  pivot_wider(names_from = "Ticker", values_from = c(Date, Price))

# Also received this warning
Warning message:
Values from `Date` and `Price` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(Ticker) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L) 

str(Returns)

tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
 $ Date_^FTSE :List of 1
  ..$ : Date[1:2524], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ Date_^GSPC :List of 1
  ..$ : Date[1:2515], format: "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" ...
 $ Price_^FTSE:List of 1
  ..$ : num [1:2524] 5500 5522 5530 5527 5534 ...
    $ Price_^GSPC:List of 1
     ..$ : num [1:2515] 1133 1137 1137 1142 1145 ...

How can i reach my end goal? Some kind of mutate or for loop or i dont know map perhaps. I dont know how to handle those functions. I only saw tutorials but i vant make them do the trick. Any ideas?

CodePudding user response：

To convert lists with values back to a dataframe:

bro.df<-do.call(rbind.data.frame, data_list)

CodePudding user response：

The challenge you are facing involves restructuring your dataset from a long format to a wider format while dealing with different lengths for each stock/index due to differing number of observations. This is a common issue when dealing with time series data of this nature. The function pivot_wider from the tidyverse is indeed a good choice for this task, however, due to the differing number of observations, a direct application of pivot_wider results in a list within each cell as you have observed.

A good way to approach this problem would be to separate the data for each stock/index into individual data frames, ensure they have the same number of observations (filling missing dates with NA where necessary), and then bind them together column-wise. Here's how you could do it:

library(yfR)
library(tidyverse)

# Define the symbols and date range
Symbols <- c("^GSPC", "^FTSE")
StartDate <- "2010-01-01"
EndDate <- "2019-12-31"

# Get the raw data from Yahoo Finance
RawData <- yf_get(Symbols, first_date = StartDate, last_date = EndDate, freq_data = "daily", do_complete_data = TRUE)

# Select and rename the columns of interest
Returns <- RawData %>% 
  select(ref_date, price_adjusted, ticker) %>% 
  rename(Date = ref_date, Price = price_adjusted, Ticker = ticker)

# Split the data into separate data frames for each stock/index
list_of_dfs <- split(Returns, Returns$Ticker)

# Ensure each data frame has the same number of observations by filling missing dates with NA
# First, create a sequence of dates that covers the entire range for both stocks
all_dates <- seq.Date(min(sapply(list_of_dfs, function(df) min(df$Date))),
                      max(sapply(list_of_dfs, function(df) max(df$Date))),
                      by = "day")

# Now, for each stock/index data frame, ensure it has a row for each date in all_dates
list_of_dfs <- lapply(list_of_dfs, function(df) {
  df <- df %>%
    full_join(data.frame(Date = all_dates), by = "Date") %>%
    arrange(Date)
  return(df)
})

# Now bind the separate data frames together column-wise
result <- NULL
for (i in seq_along(list_of_dfs)) {
  df <- list_of_dfs[[i]]
  # Create column names based on the stock/index ticker
  colnames(df)[2] <- paste0("Price_", df$Ticker[1])
  colnames(df)[1] <- paste0("Date_", df$Ticker[1])
  df$Ticker <- NULL  # remove the Ticker column as it's no longer needed
  if (is.null(result)) {
    result <- df
  } else {
    result <- bind_cols(result, df)
  }
}

# View the resulting data frame
str(result)

In this code snippet:

We first split Returns into a list of data frames, one for each stock/index, using the split function.
We then create a sequence of dates all_dates that covers the entire date range for both stocks/indices.
We use lapply to go through each data frame in list_of_dfs, and for each data frame we use full_join to ensure it has a row for each date in all_dates, filling in NA for missing dates.
We then go through each data frame in list_of_dfs with a for loop, and use bind_cols to bind them together column-wise into the final result result.
We also adjust the column names within the loop to reflect the stock/index ticker as per your request.

This will give you a data frame result with separate columns for the date and price of each stock/index, and with the same number of rows for each stock/index, with NA filled in for missing dates.