Understanding column handling function in R-CodePudding

I'm trying to understand what this R function does:

legacy_repair <- function(nms, prefix = "X", sep = "__") {
        if (length(nms) == 0)
                return(character()) # Returns a blank character variable?
        blank <- nms == "" # What does this do? Put quotations "" around the column name?
        nms[!blank] <- make.unique(nms[!blank], sep = sep)
        new_nms <-
                setdiff(paste(prefix, seq_along(nms), sep = sep), nms)
        nms[blank] <- new_nms[seq_len(sum(blank))]
        nms
}

file_names <- list.files("Data", pattern = "*.xlsx", full.names = TRUE) # This I understand

DF_list <- lapply(file_names, function(x)
                read_xlsx(x, .name_repair = legacy_repair))

My understanding so far is that the legacy_repair() function determines how to handle column names when reading the xlsx file.

When the file name in folder "Data" is blank, legacy_repair() does "return(character())"?

When the file name in folder "Data" is not blank, legacy_repair():

uses make.unique() to distinguish between columns with the same name and add "__" to the end
uses setdiff() to do ?

I hope someone can help me make sense of this!

CodePudding user response：

Here's a heavily commented version.

legacy_repair <- function(nms, prefix = "X", sep = "__") {
        ## If the input `nms` has no length, 
        ## return a `character` class object also with no length
        ##   basically, do nothing, return something of the same 0 length
        if (length(nms) == 0)
                return(character()) # Returns a blank character variable?

        blank <- nms == "" 
        ## == is a test. `nms == ""` is testing which `nms` inputs are 
        ## blank. `""` is a blank, a single string with nothing in it.

        nms[!blank] <- make.unique(nms[!blank], sep = sep)
        ## make sure all the non-blank `nms` are unique.

        new_nms <-
                setdiff(paste(prefix, seq_along(nms), sep = sep), nms)
        ## the `paste()` constructs a vector based on the prefix and sep
        ## by default, "X__1", "X__2", "X__3", ...
        ## Then the `setdiff` looks for any names in that constructed vector
        ## that are not in `nms`, call it `new_nms`

        nms[blank] <- new_nms[seq_len(sum(blank))]
        ## Set any blank names (from a few lines above) to the
        ## `new_nms` from the line above.

        nms
        ## Return the result
}

file_names <- list.files("Data", pattern = "*.xlsx", full.names = TRUE) 
# This I understand

DF_list <- lapply(file_names, function(x)
                read_xlsx(x, .name_repair = legacy_repair))
## read in all the `file_names` files
## Use the `legacy_repair` function above to fix up the column names
##   - make sure the column names are unique
##   - replace any missing column names with X__1, X__2, X__3, etc.
##     (making sure those are unique too)

There's a subtle difference here that may not matter to you, but the difference between character() and "" can be tricky at first. They are both objects of the same class (character, often colloquially called "string"). The difference is the length. character() has no length--it's basically nothing. "" has length one, we could also write it as c(""), it is a vector containing a single blank string.

c("a", "", "c", "") is a string with length 4. The 2nd and 4th elements are blanks. c("a", character(), "c", character()) is a string of length 2, character() isn't a blank string, it is no string at all.

In this function, if one of the files is blank, it has no rows and no columns. If you tried to make the column names "" there would be an error because there is no column to give a blank name to. The number of columns must match the length of the column names (and it must have class character). So if there are 0 columns, setting the column names to the 0-length character() works.