Home > Software engineering >  Mass import and shaping of data in R
Mass import and shaping of data in R

Time:09-18

SOF newbie here, so please bear with me and some daft questions!

Can anyone please tell me what the most efficient way would be to read in ~50 excel files with multiple tabs into R, select a specific tab from all 50, and combine those tabs by date to create a large dataframe?

For example, let's assume I want to see what GP activity has been like in a specific healthcare region here in the UK for the last 4 years. I currently have to download many many monthly reports like this one Appointments in General Practice, March 2022

I then have to go to table 3a, scroll down to the NHS area code "07k", and then paste the activity from that row into a separate excel sheet.

I have to do this a lot, and it takes hours, but am not coding literate enough to figure out a better way of doing this.

From the guidance of others, my attempts to date have involved downloading many months' data to a desktop folder. I set my working drive to that folder, and then attempt the following:

files <- list.files(pattern = "*.xls", full.names = TRUE)

for(i in files){
filepath <-    file.path("C:/Users/myname/Desktop/recordsfolder", i)
assign(i, readxl::read_xls(file path))
}

But the honest truth is I don't really understand this, nor what the next step would be.

If anyone could point me in a better direction I'd be very grateful.

CodePudding user response:

Okay so there is a function readData that reads and filters each of the table 3a's from the files in list.files. You then apply this function to all of the files using map and then bind_rows to reduce the list of datasets to a single dataset.

I dont know how standardised your reports are so you may need to adjust the read function accordingly

library(readxl)
library(tidyverse)

readData <- function(path){
  colnames <- read_excel(path, sheet = "Table 3a", skip = 10, n_max = 1) %>% 
    gather(header1, header2) %>% 
    mutate(across(everything(), ~gsub("\\.\\.\\.\\d ", "", .))) %>% 
    mutate(colname = ifelse(header1=="", header2, paste0(header1, "_", header2))) %>% 
    pull(colname)
  read_excel(path, sheet = "Table 3a", skip = 13, col_names = colnames) %>% 
    filter(grepl("07k", `NHS Area Code`, ignore.case = TRUE))
  }

files <- list.files("C:/Users/myname/Desktop/recordsfolder", full.name = TRUE)

files %>% 
  map(readData) %>% 
  bind_rows()

CodePudding user response:

One approach is improve your workflow is to move away from excel files, whenever possible, especially if CSV files are already available.

For example, instead download the zip file https://files.digital.nhs.uk/EA/072466/Appointments_GP_Daily_Mar_22.zip and then read in the Mar 2022 csv, using data.table::fread()

library(data.table)
mar22 <- fread(cmd = 'unzip -p Appointments_GP_Daily_Mar_22.zip CCG_CSV_Mar_22.csv')

You can then construct information for Table 3a. For example, appointment counts by status and CCG:

dcast(
    mar22[, sum(COUNT_OF_APPOINTMENTS), .(CCG_NAME,APPT_STATUS,CCG_ONS_CODE)],
    CCG_ONS_CODE CCG_NAME~APPT_STATUS, value.var="V1"
)

Output:

    CCG_ONS_CODE                                        CCG_NAME Attended   DNA Unknown
  1:    E38000006                                NHS Barnsley CCG   118066  5545    5595
  2:    E38000007                  NHS Basildon and Brentwood CCG   121998  3572    4626
  3:    E38000008                               NHS Bassetlaw CCG    61864  2293    1874
  4:    E38000014                   NHS Blackburn with Darwen CCG    76233  4228    3243
  5:    E38000015                               NHS Blackpool CCG    88336  5267    3362
 ---                                                                                    
102:    E38000253 NHS Hampshire Southampton and Isle of Wight CCG   763293 36026   30755
103:    E38000254                                NHS Kirklees CCG   219937  7859    7637
104:    E38000255                       NHS North East London CCG   865775 57636   42749
105:    E38000256                       NHS North West London CCG  1127325 59500   45858
106:    E38000257       NHS Shropshire and Telford and Wrekin CCG   236456  9836    9884
  • Related