Home > other >  Large Dataframe from a url has a very strange format. Does anyone have any advice on how to reformat
Large Dataframe from a url has a very strange format. Does anyone have any advice on how to reformat

Time:04-02

I'm currently scraping data from a spreadsheet with the following format: enter image description here

Each financial year is separated by in the following way: enter image description here

What I'd like to do is create an additional column to the left called 'Financial_Year' that takes the date from the relevant cell.

So I would like the df to look as follows: enter image description here

My code thus far is like:

library(tidyverse)
library(dplyr)
library(XLConnect)
tmp = tempfile(fileext = ".xls")
download.file(url = "https://dmo.gov.uk/umbraco/surface/DataExport/GetDataExport?reportCode=D4L&exportFormatValue=xls&parameters=&Financial Year=(All)", destfile = tmp, mode="wb")
holds <- readWorksheetFromFile(file = tmp, sheet=1) %>% 
  filter(across(everything(), ~!is.na(.)))

Does anyone have any advice on how to achieve the result? (file is quite large).

EDIT What I have been doing up till now, is downloading each year-long s/s individually, knitting them together, renaming columns and dropping various columns.

CodePudding user response:

First, we load the first column so that we can group the tables and load them iteratively. We assume that first-column with "Financial" is a clear label; from that, we skip the first few rows of each (to limit to just the monthly data within a year), and load into a list:

# library(readxl)
# library(cellranger) # imported by readxl
col1 <- readxl::read_xls(tmp, range = cellranger::cell_cols(1))
tablenames <- sub(".* - ", "", grep("Financial", col1[[1]], value = TRUE))
tablenames
#  [1] "2005-06" "2006-07" "2007-08" "2008-09" "2009-10" "2010-11" "2011-12" "2012-13" "2013-14" "2014-15" "2015-16" "2016-17"
# [13] "2017-18" "2018-19" "2019-20" "2020-21" "2021-22"
tablerows <- split(seq_along(col1[[1]]), cumsum(grepl("Financial", col1[[1]])))[-1]

alldat <- lapply(setNames(tablerows, nm = tablenames), function(R) {
  readxl::read_xls("~/Downloads/quux.xls", range = cellranger::cell_rows(R[-(1:5)]), .name_repair = "universal")
})
# New names:
### ...snip...

alldat[["2005-06"]]
# # A tibble: 56 x 14
#    ...1                      ...2         Apr..2005 May..2005 Jun..2005 Jul..2005 Aug..2005 Sep..2005 Oct..2005 Nov..2005 Dec..2005 Jan..2006 Feb..2006 Mar..2006
#    <chr>                     <chr>            <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#  1 10½% Exchequer Stock 2005 GB0003270005   0.00395   0.00395   0.00842   0.00842   0.00842   0         0         0         0         0         0         0      
#  2 8½% Treasury Stock 2005   GB0008880808 309.      309.      301.      301.      301.      301.      301.      545.        0         0         0         0      
#  3 7¾% Treasury Stock 2006   GB0008916024 549.      549.      549.      549.      549.      549.      549.      549.      549.      548.      548.      548.     
#  4 9¾% Conversion Stock 2006 GB0009021956   0         0         0.002     0.00505   0.00505   0.00505   0.00505   0.00505   0.00505   0.00505   0.00505   0.00505
#  5 7½% Treasury Stock 2006   GB0009998302 560.      602.      602.      602.      601.      601.      862.      862.      862.      862.      862.      862.     
#  6 4½% Treasury Stock 2007   GB0034040740 344.      344.      344.      344.      342.      342.      596.      596.      596.      596.      596.      596.     
#  7 8½% Treasury Loan 2007    GB0009126557 498.      498.      498.      498.      498.      498.      601.      601.      601.      601.      601.      601.     
#  8 7¼% Treasury Stock 2007   GB0009997114 550.      550.      550.      550.      549.      549.      795.      795.      795.      795.      795.      795.     
#  9 5% Treasury Stock 2008    GB0031734154 557.      558.      558.      558.      558.      558.      873.      873.      873.      872.      872.      872.     
# 10 9% Treasury Loan 2008     GB0009128371   1.59      2.12     11.8      12.0      12.4      12.8      13.0      13.0      13.1      13.9      13.9       0.0442 
# # ... with 46 more rows

alldat[["2006-07"]]
# # A tibble: 59 x 14
#    ...1                      ...2         Apr..2006 May..2006  Jun..2006  Jul..2006  Aug..2006  Sep..2006  Oct..2006 Nov..2006 Dec..2006 Jan..2007 Feb..2007 Mar..2007
#    <chr>                     <chr>            <dbl>     <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#  1 7¾% Treasury Stock 2006   GB0008916024 549.      549.       629.       629.       740.         0          0            0         0         0         0        0    
#  2 9¾% Conversion Stock 2006 GB0009021956   0.00505   0.00505    0.00505    0.00505    0.00505    0.00705    0.00705      0         0         0         0        0    
#  3 7½% Treasury Stock 2006   GB0009998302 862.      862.      1175.      1175.      1175.      1388.      1664.        5159.        0         0         0        0    
#  4 4½% Treasury Stock 2007   GB0034040740 596.      596.       596.       596.       596.       596.       588.         668.      668.      668.     1597.       0    
#  5 8½% Treasury Loan 2007    GB0009126557 601.      601.       601.       601.       601.       601.       602.         602.      602.      602.      600.     600.   
#  6 7¼% Treasury Stock 2007   GB0009997114 795.      795.       795.       795.       795.       795.       795.         795.      795.      795.      795.     795.   
#  7 5% Treasury Stock 2008    GB0031734154 873.      873.       873.       873.       873.       872.       864.         864.      864.      864.      865.     865.   
#  8 9% Treasury Loan 2008     GB0009128371   0.485     0.596      1.63       1.65       4.72      82.0       82.2         82.3      82.3     103.      104.       0.261
#  9 4% Treasury Stock 2009    GB0032785924 754.      754.       754.       754.       754.       754.       746.         745.      745.      745.      746.     746.   
# 10 8% Treasury Stock 2009    GB0009125369   2.12      4.24       4.31       4.48       4.64       4.75       4.83         5.88      5.98      6.76      6.88     0.189
# # ... with 49 more rows

Note: I used tablenames in case you prefer to work with this as a list of frames in its wide format, so referencing alldat[["2005-06"]] is meaningful. However, you can combine this into one frame by pivoting first (because column names are all different). Try this as a starting point:

library(dplyr)
library(tidyr) # pivot_longer
combined <- lapply(
  alldat,
  function(z) separate(pivot_longer(z, -(1:2)), name, c("month", "year"))
  ) %>%
  bind_rows()
combined
# # A tibble: 14,644 x 5
#    ...1                      ...2         month year    value
#    <chr>                     <chr>        <chr> <chr>   <dbl>
#  1 10½% Exchequer Stock 2005 GB0003270005 Apr   2005  0.00395
#  2 10½% Exchequer Stock 2005 GB0003270005 May   2005  0.00395
#  3 10½% Exchequer Stock 2005 GB0003270005 Jun   2005  0.00842
#  4 10½% Exchequer Stock 2005 GB0003270005 Jul   2005  0.00842
#  5 10½% Exchequer Stock 2005 GB0003270005 Aug   2005  0.00842
#  6 10½% Exchequer Stock 2005 GB0003270005 Sep   2005  0      
#  7 10½% Exchequer Stock 2005 GB0003270005 Oct   2005  0      
#  8 10½% Exchequer Stock 2005 GB0003270005 Nov   2005  0      
#  9 10½% Exchequer Stock 2005 GB0003270005 Dec   2005  0      
# 10 10½% Exchequer Stock 2005 GB0003270005 Jan   2006  0      
# # ... with 14,634 more rows
combined[10000:10010,]
# # A tibble: 11 x 5
#    ...1                   ...2         month year  value
#    <chr>                  <chr>        <chr> <chr> <dbl>
#  1 3¾% Treasury Gilt 2020 GB00B582JV65 Jul   2017  1423.
#  2 3¾% Treasury Gilt 2020 GB00B582JV65 Aug   2017  1423.
#  3 3¾% Treasury Gilt 2020 GB00B582JV65 Sep   2017  1423.
#  4 3¾% Treasury Gilt 2020 GB00B582JV65 Oct   2017  1423.
#  5 3¾% Treasury Gilt 2020 GB00B582JV65 Nov   2017  1423.
#  6 3¾% Treasury Gilt 2020 GB00B582JV65 Dec   2017  1423.
#  7 3¾% Treasury Gilt 2020 GB00B582JV65 Jan   2018  1423.
#  8 3¾% Treasury Gilt 2020 GB00B582JV65 Feb   2018  1423.
#  9 3¾% Treasury Gilt 2020 GB00B582JV65 Mar   2018  1423.
# 10 1½% Treasury Gilt 2021 GB00BYY5F581 Apr   2017   349.
# 11 1½% Treasury Gilt 2021 GB00BYY5F581 May   2017   349.

You'll likely want to convert month and year into either a Date (mutate(date = as.Date(paste(year, month, "01"), format = "%Y %b %d"))) or at least into something sortable.

  • Related