I'm currently scraping data from a spreadsheet with the following format:
Each financial year is separated by in the following way:
What I'd like to do is create an additional column to the left called 'Financial_Year' that takes the date from the relevant cell.
So I would like the df to look as follows:
My code thus far is like:
library(tidyverse)
library(dplyr)
library(XLConnect)
tmp = tempfile(fileext = ".xls")
download.file(url = "https://dmo.gov.uk/umbraco/surface/DataExport/GetDataExport?reportCode=D4L&exportFormatValue=xls¶meters=&Financial Year=(All)", destfile = tmp, mode="wb")
holds <- readWorksheetFromFile(file = tmp, sheet=1) %>%
filter(across(everything(), ~!is.na(.)))
Does anyone have any advice on how to achieve the result? (file is quite large).
EDIT What I have been doing up till now, is downloading each year-long s/s individually, knitting them together, renaming columns and dropping various columns.
CodePudding user response:
First, we load the first column so that we can group the tables and load them iteratively. We assume that first-column with "Financial"
is a clear label; from that, we skip the first few rows of each (to limit to just the monthly data within a year), and load into a list
:
# library(readxl)
# library(cellranger) # imported by readxl
col1 <- readxl::read_xls(tmp, range = cellranger::cell_cols(1))
tablenames <- sub(".* - ", "", grep("Financial", col1[[1]], value = TRUE))
tablenames
# [1] "2005-06" "2006-07" "2007-08" "2008-09" "2009-10" "2010-11" "2011-12" "2012-13" "2013-14" "2014-15" "2015-16" "2016-17"
# [13] "2017-18" "2018-19" "2019-20" "2020-21" "2021-22"
tablerows <- split(seq_along(col1[[1]]), cumsum(grepl("Financial", col1[[1]])))[-1]
alldat <- lapply(setNames(tablerows, nm = tablenames), function(R) {
readxl::read_xls("~/Downloads/quux.xls", range = cellranger::cell_rows(R[-(1:5)]), .name_repair = "universal")
})
# New names:
### ...snip...
alldat[["2005-06"]]
# # A tibble: 56 x 14
# ...1 ...2 Apr..2005 May..2005 Jun..2005 Jul..2005 Aug..2005 Sep..2005 Oct..2005 Nov..2005 Dec..2005 Jan..2006 Feb..2006 Mar..2006
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 10½% Exchequer Stock 2005 GB0003270005 0.00395 0.00395 0.00842 0.00842 0.00842 0 0 0 0 0 0 0
# 2 8½% Treasury Stock 2005 GB0008880808 309. 309. 301. 301. 301. 301. 301. 545. 0 0 0 0
# 3 7¾% Treasury Stock 2006 GB0008916024 549. 549. 549. 549. 549. 549. 549. 549. 549. 548. 548. 548.
# 4 9¾% Conversion Stock 2006 GB0009021956 0 0 0.002 0.00505 0.00505 0.00505 0.00505 0.00505 0.00505 0.00505 0.00505 0.00505
# 5 7½% Treasury Stock 2006 GB0009998302 560. 602. 602. 602. 601. 601. 862. 862. 862. 862. 862. 862.
# 6 4½% Treasury Stock 2007 GB0034040740 344. 344. 344. 344. 342. 342. 596. 596. 596. 596. 596. 596.
# 7 8½% Treasury Loan 2007 GB0009126557 498. 498. 498. 498. 498. 498. 601. 601. 601. 601. 601. 601.
# 8 7¼% Treasury Stock 2007 GB0009997114 550. 550. 550. 550. 549. 549. 795. 795. 795. 795. 795. 795.
# 9 5% Treasury Stock 2008 GB0031734154 557. 558. 558. 558. 558. 558. 873. 873. 873. 872. 872. 872.
# 10 9% Treasury Loan 2008 GB0009128371 1.59 2.12 11.8 12.0 12.4 12.8 13.0 13.0 13.1 13.9 13.9 0.0442
# # ... with 46 more rows
alldat[["2006-07"]]
# # A tibble: 59 x 14
# ...1 ...2 Apr..2006 May..2006 Jun..2006 Jul..2006 Aug..2006 Sep..2006 Oct..2006 Nov..2006 Dec..2006 Jan..2007 Feb..2007 Mar..2007
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 7¾% Treasury Stock 2006 GB0008916024 549. 549. 629. 629. 740. 0 0 0 0 0 0 0
# 2 9¾% Conversion Stock 2006 GB0009021956 0.00505 0.00505 0.00505 0.00505 0.00505 0.00705 0.00705 0 0 0 0 0
# 3 7½% Treasury Stock 2006 GB0009998302 862. 862. 1175. 1175. 1175. 1388. 1664. 5159. 0 0 0 0
# 4 4½% Treasury Stock 2007 GB0034040740 596. 596. 596. 596. 596. 596. 588. 668. 668. 668. 1597. 0
# 5 8½% Treasury Loan 2007 GB0009126557 601. 601. 601. 601. 601. 601. 602. 602. 602. 602. 600. 600.
# 6 7¼% Treasury Stock 2007 GB0009997114 795. 795. 795. 795. 795. 795. 795. 795. 795. 795. 795. 795.
# 7 5% Treasury Stock 2008 GB0031734154 873. 873. 873. 873. 873. 872. 864. 864. 864. 864. 865. 865.
# 8 9% Treasury Loan 2008 GB0009128371 0.485 0.596 1.63 1.65 4.72 82.0 82.2 82.3 82.3 103. 104. 0.261
# 9 4% Treasury Stock 2009 GB0032785924 754. 754. 754. 754. 754. 754. 746. 745. 745. 745. 746. 746.
# 10 8% Treasury Stock 2009 GB0009125369 2.12 4.24 4.31 4.48 4.64 4.75 4.83 5.88 5.98 6.76 6.88 0.189
# # ... with 49 more rows
Note: I used tablenames
in case you prefer to work with this as a list of frames in its wide format, so referencing alldat[["2005-06"]]
is meaningful. However, you can combine this into one frame by pivoting first (because column names are all different). Try this as a starting point:
library(dplyr)
library(tidyr) # pivot_longer
combined <- lapply(
alldat,
function(z) separate(pivot_longer(z, -(1:2)), name, c("month", "year"))
) %>%
bind_rows()
combined
# # A tibble: 14,644 x 5
# ...1 ...2 month year value
# <chr> <chr> <chr> <chr> <dbl>
# 1 10½% Exchequer Stock 2005 GB0003270005 Apr 2005 0.00395
# 2 10½% Exchequer Stock 2005 GB0003270005 May 2005 0.00395
# 3 10½% Exchequer Stock 2005 GB0003270005 Jun 2005 0.00842
# 4 10½% Exchequer Stock 2005 GB0003270005 Jul 2005 0.00842
# 5 10½% Exchequer Stock 2005 GB0003270005 Aug 2005 0.00842
# 6 10½% Exchequer Stock 2005 GB0003270005 Sep 2005 0
# 7 10½% Exchequer Stock 2005 GB0003270005 Oct 2005 0
# 8 10½% Exchequer Stock 2005 GB0003270005 Nov 2005 0
# 9 10½% Exchequer Stock 2005 GB0003270005 Dec 2005 0
# 10 10½% Exchequer Stock 2005 GB0003270005 Jan 2006 0
# # ... with 14,634 more rows
combined[10000:10010,]
# # A tibble: 11 x 5
# ...1 ...2 month year value
# <chr> <chr> <chr> <chr> <dbl>
# 1 3¾% Treasury Gilt 2020 GB00B582JV65 Jul 2017 1423.
# 2 3¾% Treasury Gilt 2020 GB00B582JV65 Aug 2017 1423.
# 3 3¾% Treasury Gilt 2020 GB00B582JV65 Sep 2017 1423.
# 4 3¾% Treasury Gilt 2020 GB00B582JV65 Oct 2017 1423.
# 5 3¾% Treasury Gilt 2020 GB00B582JV65 Nov 2017 1423.
# 6 3¾% Treasury Gilt 2020 GB00B582JV65 Dec 2017 1423.
# 7 3¾% Treasury Gilt 2020 GB00B582JV65 Jan 2018 1423.
# 8 3¾% Treasury Gilt 2020 GB00B582JV65 Feb 2018 1423.
# 9 3¾% Treasury Gilt 2020 GB00B582JV65 Mar 2018 1423.
# 10 1½% Treasury Gilt 2021 GB00BYY5F581 Apr 2017 349.
# 11 1½% Treasury Gilt 2021 GB00BYY5F581 May 2017 349.
You'll likely want to convert month
and year
into either a Date
(mutate(date = as.Date(paste(year, month, "01"), format = "%Y %b %d"))
) or at least into something sortable.