How can I get different datasets, all in one excel sheet-CodePudding

I have 5 databases (DEMOGRAPHIE, CARACTERISTIQUE, FIN, EI ,EIG) all on the same excel sheet. I will have to retrieve each database without having to copy and paste.

This is how the different databases look on the Excel sheet.

I think there are 3 empty rows between each database (but I don't know if this will always be the case).

And there is a row before each dataset corresponding to the title of the dataset. There is a general pattern that indicates where one dataset ends and another begins, but I have no idea how to retrieve each database while keeping the column names in the header

And that is how the datasets appear after importing.

I am wondering if it is more easy to select each after importing or during importing in R

CodePudding user response：

If there are always 3 blank rows separating the tables and they always start with the table name and then column names, the following should work.

Here we first read in the raw data and then parse it to find the positions of each table (using the 3 blank row rule). This is done by finding a match for a string of 3 NA values which is how read_excel() reads blank rows.

For the width, this solution takes advantage of the fact that if you specify the start and end rows, read_excel() can guess the width. Therefore you don't have to figure that out as long as there's nothing else in the table that would confuse it.

For the purposes of demonstration, I made up a minimal sample file with just the first 3 lines of mtcars and iris but it should scale to any arbitrary size.

library(tidyverse)
library(readxl)
library(writexl) # for creating test input

# sample data
d <- structure(list(...1 = c("MTCARS", "mpg", "21", "21", "22.8", NA, NA, NA, "IRIS", "Sepal.Length", "5.0999999999999996", "4.9000000000000004", "4.7"), ...2 = c(NA, "cyl", "6", "6", "4", NA, NA, NA, NA, "Sepal.Width", "3.5", "3", "3.2"), ...3 = c(NA, "disp", "160", "160", "108", NA, NA, NA, NA, "Petal.Length", "1.4", "1.4", "1.3"), ...4 = c(NA, "hp", "110", "110", "93", NA, NA, NA, NA, "Petal.Width", "0.2", "0.2", "0.2"), ...5 = c(NA, "drat", "3.9", "3.9", "3.85", NA, NA, NA, NA, "Species", "setosa", "setosa", "setosa"), ...6 = c(NA, "wt", "2.62", "2.875", "2.3199999999999998", NA, NA, NA, NA, NA, NA, NA, NA), ...7 = c(NA, "qsec", "16.46", "17.02", "18.61", NA, NA, NA, NA, NA, NA, NA, NA), ...8 = c(NA, "vs", "0", "0", "1", NA, NA, NA, NA, NA, NA, NA, NA), ...9 = c(NA, "am", "1", "1", "1", NA, NA, NA, NA, NA, NA, NA, NA), ...10 = c(NA, "gear", "4", "4", "4", NA, NA, NA, NA, NA, NA, NA, NA), ...11 = c(NA, "carb", "4", "4", "1", NA, NA, NA, NA, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -13L))
d %>% write_xlsx("test.xlsx", format_headers = F, col_names = F)

# read in raw data
raw_input <- read_excel("test.xlsx", col_names = F)

# take first column to find each table
first_col <- raw_input %>% pull(1)

# find positions for each table
table_start_rows <- c(1, which(first_col %in% c(NA,NA,NA))[c(T,F,F)]   3)   1
table_end_rows <- c(which(first_col %in% c(NA,NA,NA))[c(T,F,F)] - 1, length(first_col))
table_lengths <- table_end_rows - table_start_rows
# extract table names
table_names <- first_col[table_start_rows - 1]

# read in each table and capture into list
l <- map2(.x = (table_start_rows - 1),
     .y = table_lengths,
     .f = ~read_excel("test.xlsx", skip = .x, n_max = .y))

# add table names
names(l) <- table_names

# inspect output
l
#> $MTCARS
#> # A tibble: 3 × 11
#>   mpg   cyl   disp  hp    drat  wt                 qsec  vs    am    gear  carb 
#>   <chr> <chr> <chr> <chr> <chr> <chr>              <chr> <chr> <chr> <chr> <chr>
#> 1 21    6     160   110   3.9   2.62               16.46 0     1     4     4    
#> 2 21    6     160   110   3.9   2.875              17.02 0     1     4     4    
#> 3 22.8  4     108   93    3.85  2.3199999999999998 18.61 1     1     4     1    
#> 
#> $IRIS
#> # A tibble: 3 × 5
#>   Sepal.Length       Sepal.Width Petal.Length Petal.Width Species
#>   <chr>              <chr>       <chr>        <chr>       <chr>  
#> 1 5.0999999999999996 3.5         1.4          0.2         setosa 
#> 2 4.9000000000000004 3           1.4          0.2         setosa 
#> 3 4.7                3.2         1.3          0.2         setosa

^{Created on 2022-11-17 with reprex v2.0.2}