Home > other >  Can I write identical xlsx files from the same data frame in R?
Can I write identical xlsx files from the same data frame in R?

Time:02-01

Can I make sure that two XLSX files (written with openxlsx::write.xlsx) are identical, when given the same data to write? I think there's a timestamp written to the spreadsheet which means the same data written more than one second apart creates a different file.

For example, when written in rapid succession:

library(openxlsx)
write.xlsx(mtcars, "/tmp/t1.xlsx");write.xlsx(mtcars, "/tmp/t2.xlsx")

the files are identical:

$ md5sum /tmp/t?.xlsx
c9b5f6509e20dd62b158debfbef376fe  /tmp/t1.xlsx
c9b5f6509e20dd62b158debfbef376fe  /tmp/t2.xlsx

but if I sleep between writes:

unlink("/tmp/t1.xlsx") # remove previous
unlink("/tmp/t2.xlsx")
write.xlsx(mtcars, "/tmp/t1.xlsx");Sys.sleep(2);write.xlsx(mtcars, "/tmp/t2.xlsx")

its all different:

$ md5sum /tmp/t?.xlsx
460945a610df3bc8a1ccdae9eb86c1fa  /tmp/t1.xlsx
a4865be49994092173792c9f7354e435  /tmp/t2.xlsx

My use case is a process that generates an XLSX file which is going into a git repository. If I automate this, then the XLSX file is going to change every time, even if the source data hasn't. I suppose could test if the data has changed earlier in the process and not generate a new XLSX file, but it just seems easier to let git do the "has this changed?" test but clearly invisible metadata in the XLSX breaks that. Call me lazy.

Can the XLSX metadata be set to prevent this? I guess maybe there's a "Creation Date" in there somewhere. I don't care if this is 1970-01-01 all the time.

Pre-emptive defence: No, I can't use a CSV, the XLSX has multiple sheets and that's what my end-users want. Yes, I'm already also writing it to a SQlite database and that is identical when writing the same data to it.

I don't think this can be done with openxlsx as is, since the difference is due to the metadata XML created: https://github.com/ycphs/openxlsx/blob/7742063a4473879490d789c552bb8e6cc9a0d2c7/R/baseXML.R#L77 where it puts the current Sys.time() into the created field.

There appear to be two sources of difference. First, there's the Excel metadata written to the <dcterms:created> metadata within the MS Excel document structure. But even setting that the same (by monkey-patching openxlsx) still leaves a difference because the document is bundled using standard ZIP format and that also has datestamp headers.

Here's two XLSX files, unzipped, and this shows all the same CRC-32 values, so the files within are identical:

Archive:  test1.xlsx
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     587  Defl:N      234  60% 2022-01-31 15:22 b5dbec60  _rels/.rels
    1402  Defl:N      362  74% 2022-01-31 15:22 63422601  [Content_Types].xml
     284  Defl:N      173  39% 2022-01-31 15:22 f9153db0  docProps/app.xml
     552  Defl:N      278  50% 2022-01-31 15:22 37126cbe  docProps/core.xml
     696  Defl:N      229  67% 2022-01-31 15:22 14a147d3  xl/_rels/workbook.xml.rels
    4500  Defl:N      311  93% 2022-01-31 15:22 285db1ad  xl/printerSettings/printerSettings1.bin
     601  Defl:N      203  66% 2022-01-31 15:22 211e1d6e  xl/sharedStrings.xml
    1127  Defl:N      464  59% 2022-01-31 15:22 0d8ee71d  xl/styles.xml
    7075  Defl:N     1361  81% 2022-01-31 15:22 050f988c  xl/theme/theme1.xml
     950  Defl:N      382  60% 2022-01-31 15:22 1b8cce29  xl/workbook.xml
     612  Defl:N      223  64% 2022-01-31 15:22 f0584777  xl/worksheets/_rels/sheet1.xml.rels
   12729  Defl:N     2204  83% 2022-01-31 15:22 18057777  xl/worksheets/sheet1.xml
--------          -------  ---                            -------
   31115             6424  79%                            12 files
$ unzip -v test2.xlsx
Archive:  test2.xlsx
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     587  Defl:N      234  60% 2022-01-31 15:22 b5dbec60  _rels/.rels
    1402  Defl:N      362  74% 2022-01-31 15:22 63422601  [Content_Types].xml
     284  Defl:N      173  39% 2022-01-31 15:22 f9153db0  docProps/app.xml
     552  Defl:N      278  50% 2022-01-31 15:22 37126cbe  docProps/core.xml
     696  Defl:N      229  67% 2022-01-31 15:22 14a147d3  xl/_rels/workbook.xml.rels
    4500  Defl:N      311  93% 2022-01-31 15:22 285db1ad  xl/printerSettings/printerSettings1.bin
     601  Defl:N      203  66% 2022-01-31 15:22 211e1d6e  xl/sharedStrings.xml
    1127  Defl:N      464  59% 2022-01-31 15:22 0d8ee71d  xl/styles.xml
    7075  Defl:N     1361  81% 2022-01-31 15:22 050f988c  xl/theme/theme1.xml
     950  Defl:N      382  60% 2022-01-31 15:22 1b8cce29  xl/workbook.xml
     612  Defl:N      223  64% 2022-01-31 15:22 f0584777  xl/worksheets/_rels/sheet1.xml.rels
   12729  Defl:N     2204  83% 2022-01-31 15:22 18057777  xl/worksheets/sheet1.xml

but the files still differ:

$ md5sum test1.xlsx test2.xlsx 
27783e8b19631039a1c940db214f25e1  test1.xlsx
ba0678946aea1e01093ce25130b2c467  test2.xlsx

because of the ZIP metadata, visible with exiftool:

$ exiftool test*.xlsx | grep Zip | grep Date
Zip Modify Date                 : 2022:01:31 15:22:52
Zip Modify Date                 : 2022:01:31 15:22:54

CodePudding user response:

A possible workaround is to redefine genBaseCore function in openxlsx namespace using assignInNamespace.

In the example below, xlsx file is created one day before Sys.time():

library(openxlsx)

genBaseCore <- function(creator = "", title = NULL, subject = NULL, category = NULL) {
  
  replaceIllegalCharacters <- function(v){
    
    vEnc <- Encoding(v)
    v <- as.character(v)
    
    flg <- vEnc != "UTF-8"
    if(any(flg))
      v[flg] <- iconv(v[flg], from = "", to = "UTF-8")
    
    v <- gsub('&', "&amp;", v, fixed = TRUE)
    v <- gsub('"', "&quot;", v, fixed = TRUE)
    v <- gsub("'", "&apos;", v, fixed = TRUE)
    v <- gsub('<', "&lt;", v, fixed = TRUE)
    v <- gsub('>', "&gt;", v, fixed = TRUE)
    
    ## Escape sequences
    v <- gsub("\a", "", v, fixed = TRUE)
    v <- gsub("\b", "", v, fixed = TRUE)
    v <- gsub("\v", "", v, fixed = TRUE)
    v <- gsub("\f", "", v, fixed = TRUE)
    
    return(v)
  }
  
  core <- '<coreProperties xmlns="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">'
  
  core <- stringi:::stri_c(core, sprintf("<dc:creator>%s</dc:creator>", replaceIllegalCharacters(creator)))
  core <- stringi:::stri_c(core, sprintf("<cp:lastModifiedBy>%s</cp:lastModifiedBy>", replaceIllegalCharacters(creator)))

# Modify creation date here
  core <- stringi:::stri_c(core, sprintf('<dcterms:created xsi:type="dcterms:W3CDTF">%s</dcterms:created>', format(Sys.time()-86400, "%Y-%m-%dT%H:%M:%SZ")))
  
  if (!is.null(title)) {
    core <- stringi:::stri_c(core, sprintf("<dc:title>%s</dc:title>", replaceIllegalCharacters(title)))
  }
  
  if (!is.null(subject)) {
    core <- stringi:::stri_c(core, sprintf("<dc:subject>%s</dc:subject>", replaceIllegalCharacters(subject)))
  }
  
  if (!is.null(category)) {
    core <- stringi:::stri_c(core, sprintf("<cp:category>%s</cp:category>", replaceIllegalCharacters(category)))
  }
  
  core <- stringi:::stri_c(core, "</coreProperties>")
  
  return(core)
}


assignInNamespace("genBaseCore", genBaseCore, ns="openxlsx")

write.xlsx(mtcars, "test.xlsx")
<Created>2022-01-30T15:13:27Z</Created>

CodePudding user response:

You can try a simpler wrapper for comparing workbook objects (assuming you'd read in the previous workbook) and compare it to the current workbook.

library(openxlsx)
file1 <- temp_xlsx()
file2 <- temp_xlsx()
write.xlsx(mtcars, file1)
Sys.sleep(2)
write.xlsx(mtcars, file2)

wb1 <- loadWorkbook(file1)
wb2 <- loadWorkbook(file2)

all_equal_wb <- function(target, current) {
  exp <- "Workbook"
  attr(exp, "package") <- "openxlsx"
  stopifnot(identical(class(target), exp), identical(class(current), exp))
  target <- target$copy()
  current <- current$copy()
  target$core <- ""
  current$core <- ""
  # openxlsx::all.equal.Workbook
  all.equal(target, current)
}

all.equal(wb1, wb2)
#> [1] "Component \"core\": 1 string mismatch"
all_equal_wb(wb1, wb2)
#> [1] TRUE

Created on 2022-01-31 by the reprex package (v2.0.1)

openxlsx::all.equal.Workbook() doesn't have this sort of control (... is ignored) but it can be added. The package is still maintained so feel free to leave an issue: https://github.com/ycphs/openxlsx/issues

  •  Tags:  
  • Related