Home > Software engineering >  Reading in xlsx files that are unstructured in R
Reading in xlsx files that are unstructured in R

Time:07-07

I am having issues trying to find an efficient way to read in multiple unstructured .xlsx files into R. This requires a bit of explaining, so anyone who is trying to assist can understand exactly what I am trying to do.

I have been suggested this and have decided it would be easier to use dput to replicate my dataset. The structure can be replicated with the code below:

x <- structure(list(...1 = c("Company Name", "Contact", 
                             "Name", "Phone #", "Scope of Work", NA, 
                             "Trees", "36\" Box Southern Live Oak (1.5\" Caliper)", "36\" Box Thornless Chilean Mesquite  (1.5\" Caliper)", 
                             NA, "DG", 
                             "Desert Gold", "Pink Coral"
), ...2 = c("To:", "Date:", "Job Name:", "Plan Date:", "Install All Trees, Shrubs, Irrigation and Landscape Material to Meet all Landscape Plans and Specs", 
            NA, NA, NA, NA, NA, NA, NA, NA), ...3 = c("Contractor", 
                                                      "DATE ID", "Job ID", "DATE ID", NA, NA, NA, NA, NA, NA, NA, 
                                                      NA, NA), ...4 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
                                                                        NA, NA), ...5 = c(NA, NA, NA, NA, NA, NA, "Quantity", "20", "38", 
                                                                                          NA, "Quantity", "26", "32"), ...6 = c(NA, NA, NA, NA, NA, NA, NA, 
                                                                                                                            10, 10, NA, NA, 10, 10), ...7 = c(NA, NA, NA, NA, NA, NA, NA, 
                                                                                                                                                              200, 380, NA, NA, 260, 320)), row.names = c(NA, -13L), class = c("tbl_df", 
                                                                                                                                                                                                                                 "tbl", "data.frame"))                                                                                                                                                                                                          

The tibble will look like this, if you use the code above:

   ...1                             ...2  ...3  ...4  ...5   ...6  ...7
   <chr>                            <chr> <chr> <lgl> <chr> <dbl> <dbl>
 1 "Company Name"                   To:   Cont~ NA    NA       NA    NA
 2 "Contact"                        Date: DATE~ NA    NA       NA    NA
 3 "Name"                           Job ~ Job ~ NA    NA       NA    NA
 4 "Phone #"                        Plan~ DATE~ NA    NA       NA    NA
 5 "Scope of Work"                  Inst~ NA    NA    NA       NA    NA
 6  NA                              NA    NA    NA    NA       NA    NA
 7 "Trees"                          NA    NA    NA    Quan~    NA    NA
 8 "36\" Box Southern Live Oak (1.~ NA    NA    NA    20       10   200
 9 "36\" Box Thornless Chilean Mes~ NA    NA    NA    38       10   380
10  NA                              NA    NA    NA    NA       NA    NA
11 "DG"                             NA    NA    NA    Quan~    NA    NA
12 "Desert Gold"                    NA    NA    NA    26       10   260
13 "Pink Coral"                     NA    NA    NA    32       10   320

TLDR: These files consist of landscape bid forms to contractors. If you notice, the subset x[1:5,1:3] are information about the job, such as the job name, the date, the contractors name, the landscape company's name, etc. Every single one of the .xlsx files have the exact same format regarding that subset. I would like to keep the job name, but, for the purpose of this question, I will not make it the focus.

Under the x[1:5,1:3] subset, starting on x[7,1], there is a header named Trees, which is bolded on the .xlsx files. The next header is DG, which is also bolded. The values are right under the headers, so the first value for Trees is "36\" Box Southern Live Oak (1.5\" Caliper)" and the first value for DG is "Desert Gold". These values are not bolded.

It is important to stress that there are about 10-15 different headers throughout hundreds of files and the amount of values for each header can range from 1 to 100 rows. These headers and values are always in x[,1].

I am trying to figure out how to partition the sections (DG, TREES...,) and read them into R as their own dataframes. I think the most ideal way to do this is by reading the files into a list and then separating the sections into their own dataframes into a nested list.

Lastly, if you notice, in x[,5], there are headers named Quantity, which are also bolded, and then there are integers under each of the Quantity's that are not bolded. x[,6] is the price of each of those quantities and x[,7] is those 2 columns multiplied together. I am trying to preserve these numbers as well.

In the end, I am trying to have multiple tables or dataframes in R that look like so: df1

 Trees Quantity Price  Totals
1 36"...     20    10    200
2 36"...     38    10    380

df2

    DG          Quantity Price Totals
1 Desert Gold       26    10    260
2  Pink Coral       32    10    320

I am trying to create some way to efficiently do that over hundreds of .xlsx datasets.

So far, I have created a list that has each of the excel files in it. There are 248 files in a folder that I have on my local PC. I read in each of the files into a list like so:

excel_list <- vector(mode = "list", length = 248)


for(i in 1:length(list.files("."))){
  excel_list[[i]] <- read_excel(list.files(".")[i], col_names = F)
}

CodePudding user response:

To achieve your desired result you first have to identify the rows containing the section headers which according to your example data could be achieved by finding rows containing "Quantity" in the fifth column. After doing so we some additional data wrangling steps to first convert your data into a tidy format. Finally, we could split the data by section to achieve your desired result:

library(janitor)
library(dplyr, warn = FALSE)
library(tidyr)

tidy_data <- function(x) {
  x %>% 
    remove_empty() %>% 
    mutate(is_header_row = grepl("^Quan", `...5`), 
           section = ifelse(is_header_row, `...1`, NA_character_)) %>% 
    fill(section) %>% 
    filter(!is.na(section), !is_header_row) %>% 
    select(-is_header_row) %>% 
    remove_empty() %>% 
    rename(Item = 1, Quantity = 2, Price = 3, Totals = 4)
}

xx <- tidy_data(x)

xx
#> # A tibble: 4 × 5
#>   Item                                             Quantity Price Totals section
#>   <chr>                                            <chr>    <dbl>  <dbl> <chr>  
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)"     20          10    200 Trees  
#> 2 "36\" Box Thornless Chilean Mesquite  (1.5\" Ca… 38          10    380 Trees  
#> 3 "Desert Gold"                                    26          10    260 DG     
#> 4 "Pink Coral"                                     32          10    320 DG

xx %>%
  split(., .$section) %>%
  purrr::imap(function(x, y) { x %>% select(-section) %>% rename("{y}" := 1) })
#> $DG
#> # A tibble: 2 × 4
#>   DG          Quantity Price Totals
#>   <chr>       <chr>    <dbl>  <dbl>
#> 1 Desert Gold 26          10    260
#> 2 Pink Coral  32          10    320
#> 
#> $Trees
#> # A tibble: 2 × 4
#>   Trees                                                  Quantity Price Totals
#>   <chr>                                                  <chr>    <dbl>  <dbl>
#> 1 "36\" Box Southern Live Oak (1.5\" Caliper)"           20          10    200
#> 2 "36\" Box Thornless Chilean Mesquite  (1.5\" Caliper)" 38          10    380
  • Related