SOF newbie here, so please bear with me and some daft questions!
Can anyone please tell me what the most efficient way would be to read in ~50 excel files with multiple tabs into R, select a specific tab from all 50, and combine those tabs by date to create a large dataframe?
For example, let's assume I want to see what GP activity has been like in a specific healthcare region here in the UK for the last 4 years. I currently have to download many many monthly reports like this one Appointments in General Practice, March 2022
I then have to go to table 3a, scroll down to the NHS area code "07k", and then paste the activity from that row into a separate excel sheet.
I have to do this a lot, and it takes hours, but am not coding literate enough to figure out a better way of doing this.
From the guidance of others, my attempts to date have involved downloading many months' data to a desktop folder. I set my working drive to that folder, and then attempt the following:
files <- list.files(pattern = "*.xls", full.names = TRUE)
for(i in files){
filepath <- file.path("C:/Users/myname/Desktop/recordsfolder", i)
assign(i, readxl::read_xls(file path))
}
But the honest truth is I don't really understand this, nor what the next step would be.
If anyone could point me in a better direction I'd be very grateful.
CodePudding user response:
Okay so there is a function readData
that reads and filters each of the table 3a's from the files in list.files
. You then apply this function to all of the files using map
and then bind_rows
to reduce the list of datasets to a single dataset.
I dont know how standardised your reports are so you may need to adjust the read function accordingly
library(readxl)
library(tidyverse)
readData <- function(path){
colnames <- read_excel(path, sheet = "Table 3a", skip = 10, n_max = 1) %>%
gather(header1, header2) %>%
mutate(across(everything(), ~gsub("\\.\\.\\.\\d ", "", .))) %>%
mutate(colname = ifelse(header1=="", header2, paste0(header1, "_", header2))) %>%
pull(colname)
read_excel(path, sheet = "Table 3a", skip = 13, col_names = colnames) %>%
filter(grepl("07k", `NHS Area Code`, ignore.case = TRUE))
}
files <- list.files("C:/Users/myname/Desktop/recordsfolder", full.name = TRUE)
files %>%
map(readData) %>%
bind_rows()
CodePudding user response:
One approach is improve your workflow is to move away from excel files, whenever possible, especially if CSV files are already available.
For example, instead download the zip file https://files.digital.nhs.uk/EA/072466/Appointments_GP_Daily_Mar_22.zip
and then read in the Mar 2022 csv, using data.table::fread()
library(data.table)
mar22 <- fread(cmd = 'unzip -p Appointments_GP_Daily_Mar_22.zip CCG_CSV_Mar_22.csv')
You can then construct information for Table 3a. For example, appointment counts by status and CCG:
dcast(
mar22[, sum(COUNT_OF_APPOINTMENTS), .(CCG_NAME,APPT_STATUS,CCG_ONS_CODE)],
CCG_ONS_CODE CCG_NAME~APPT_STATUS, value.var="V1"
)
Output:
CCG_ONS_CODE CCG_NAME Attended DNA Unknown
1: E38000006 NHS Barnsley CCG 118066 5545 5595
2: E38000007 NHS Basildon and Brentwood CCG 121998 3572 4626
3: E38000008 NHS Bassetlaw CCG 61864 2293 1874
4: E38000014 NHS Blackburn with Darwen CCG 76233 4228 3243
5: E38000015 NHS Blackpool CCG 88336 5267 3362
---
102: E38000253 NHS Hampshire Southampton and Isle of Wight CCG 763293 36026 30755
103: E38000254 NHS Kirklees CCG 219937 7859 7637
104: E38000255 NHS North East London CCG 865775 57636 42749
105: E38000256 NHS North West London CCG 1127325 59500 45858
106: E38000257 NHS Shropshire and Telford and Wrekin CCG 236456 9836 9884