Home > Net >  How to format data from excel containing two rows of column headers to be able to use in R?
How to format data from excel containing two rows of column headers to be able to use in R?

Time:12-30

I am importing the following table 1 into R but am struggling with the formatting, as each column has two headers. My desired output is the second table 2. I plan to use tidyr to gather the data.

Another obstacle I have is the merged cells. I have been using fillMergedCells=TRUE to duplicate this.

 read.xlsx(xlsxFile ="C:/Users/X/X/Desktop/X.xlsx",fillMergedCells = TRUE)

CodePudding user response:

One option would be to

  1. read your excel file with option colNames = FALSE
  2. Paste the first two rows together and use the result as the column names. Here I use an underscore as the separator which makes it easy to split the names later on.
  3. Get rid of the first two rows
  4. Use tidyr::pivot_longer to convert to long format.
# df <- openxlsx::read.xlsx(xlsxFile ="data/test2.xlsx", fillMergedCells = TRUE, colNames = FALSE)

# Use first two rows as names
names(df) <- paste(df[1, ], df[2, ], sep = "_")
names(df)[1] <- "category"
# Get rid of first two rows and columns containing year average
df <- df[-c(1:2), ]
df <- df[, !grepl("^Year", names(df))]

library(tidyr)
library(dplyr)

df %>%
  pivot_longer(-category, names_to = c("Time", ".value"), names_pattern = "^(.*?)_(.*)$") %>%
  arrange(Time)
#> # A tibble: 16 × 4
#>    category Time   Y     Z    
#>    <chr>    <chr>  <chr> <chr>
#>  1 Total    Feb-21 1     1    
#>  2 A        Feb-21 2     2    
#>  3 B        Feb-21 3     3    
#>  4 C        Feb-21 4     4    
#>  5 D        Feb-21 5     5    
#>  6 E        Feb-21 6     6    
#>  7 F        Feb-21 7     7    
#>  8 G        Feb-21 8     8    
#>  9 Total    Jan-21 1     1    
#> 10 A        Jan-21 2     2    
#> 11 B        Jan-21 3     3    
#> 12 C        Jan-21 4     4    
#> 13 D        Jan-21 5     5    
#> 14 E        Jan-21 6     6    
#> 15 F        Jan-21 7     7    
#> 16 G        Jan-21 8     8

DATA

df <- structure(list(X1 = c(
  NA, NA, "Total", "A", "B", "C", "D", "E",
  "F", "G"
), X2 = c(
  "Year Rolling Avg.", "Share", NA, "1", "1",
  "1", "1", "1", "1", "1"
), X3 = c(
  "Year Rolling Avg.", "Y", "1",
  "2", "3", "4", "5", "6", "7", "8"
), X4 = c(
  "Year Rolling Avg.",
  "Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X5 = c(
  "Jan-21",
  "Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X6 = c(
  "Jan-21",
  "Z", "1", "2", "3", "4", "5", "6", "7", "8"
), X7 = c(
  "Feb-21",
  "Y", "1", "2", "3", "4", "5", "6", "7", "8"
), X8 = c(
  "Feb-21",
  "Z", "1", "2", "3", "4", "5", "6", "7", "8"
)), row.names = c(
  NA,
  10L
), class = "data.frame")
  • Related