How to format data from excel containing two rows of column headers to be able to use in R?-CodePudding

I am importing the following table 1 into R but am struggling with the formatting, as each column has two headers. My desired output is the second table 2. I plan to use tidyr to gather the data.

Another obstacle I have is the merged cells. I have been using fillMergedCells=TRUE to duplicate this.

 read.xlsx(xlsxFile ="C:/Users/X/X/Desktop/X.xlsx",fillMergedCells = TRUE)

CodePudding user response：

One option would be to

read your excel file with option colNames = FALSE
Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
Get rid of the first two rows
Use tidyr::pivot_longer to convert to long format.

# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)

# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]

library(tidyr)
library(dplyr)

df %>%
  pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
  arrange(Time)
#> # A tibble: 16 × 4
#>    category Time   Y     Z    
#>    <chr>    <chr>  <chr> <chr>
#>  1 Total    Feb-21 1     1    
#>  2 A        Feb-21 2     2    
#>  3 B        Feb-21 3     3    
#>  4 C        Feb-21 4     4    
#>  5 D        Feb-21 5     5    
#>  6 E        Feb-21 6     6    
#>  7 F        Feb-21 7     7    
#>  8 G        Feb-21 8     8    
#>  9 Total    Jan-21 1     1    
#> 10 A        Jan-21 2     2    
#> 11 B        Jan-21 3     3    
#> 12 C        Jan-21 4     4    
#> 13 D        Jan-21 5     5    
#> 14 E        Jan-21 6     6    
#> 15 F        Jan-21 7     7    
#> 16 G        Jan-21 8     8

DATA

df <- structure(list(X1 = c(
  NA, NA, "Total", "A", "B", "C", "D", "E",
  "F", "G"
), X2 = c(
  "Year Rolling Avg.", "Share", NA, "1", "1",
  "1", "1", "1", "1", "1"
), X3 = c(
  "Year Rolling Avg.", "Y", "1",
  "2", "3", "4", "5", "6", "7", "8"
), X4 = c(
  "Year Rolling Avg.",
  "Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
  "Jan-21",
  "Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
  "Jan-21",
  "Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
  "Feb-21",
  "Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
  "Feb-21",
  "Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
  NA,
  10L
), class = "data.frame")