Home > other >  Prevent change to dataframe format in R
Prevent change to dataframe format in R

Time:02-04

I have a dataframe that must have a specific layout. Is there a way for me to make R reject any command I attempt that would change the number or names of the columns?

It is easy to check the format of the data table manually, but I have found no way to make R do it for me automatically every time I execute a piece of code.

regards

CodePudding user response:

You mention the names and columns need to be the same, also realize that with data.table also names are updated by reference. See the example below.

foo <- data.table(
  x = letters[1:5],
  y = LETTERS[1:5]
)

colnames <- names(foo)

colnames
# [1] "x" "y"

setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]

colnames
# [1] "a" "b" "z"

identical(colnames, names(foo))
# [1] TRUE

To check that both the columns and names are unalterated (and in same order here) you can take right away a copy of the names. And after each code run, you can check the current names with the copied names.

foo <- data.table(
  x = letters[1:5],
  y = LETTERS[1:5]
)

colnames <- copy(names(foo))

setnames(foo, colnames, c("a", "b"))
foo[, z := "oops"]

identical(colnames, names(foo))
[1] FALSE

colnames
# [1] "x" "y"

names(foo)
# [1] "a" "b" "z"

CodePudding user response:

This doesn’t offer the level of foolproof safety I think you’re looking for (hard to know without more details), but you could define a function operator that yields modified functions that error if changes to columns are detected:

same_cols <- function(fn) {
  function(.data, ...) {
    out <- fn(.data, ...)
    stopifnot(identical(sort(names(.data)), sort(names(out))))
    out
  }
}

For example, you could create modified versions of dplyr functions:

library(dplyr)

my_mutate <- same_cols(mutate)
my_summarize <- same_cols(summarize)

which work as usual if columns are preserved:

mtcars %>%
  my_mutate(mpg = mpg / 2) %>%
  head()
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         10.50   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     10.50   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        11.40   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    10.70   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout  9.35   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant            9.05   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars %>% 
  my_summarize(across(everything(), mean))
       mpg    cyl     disp       hp     drat      wt     qsec     vs      am
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
    gear   carb
1 3.6875 2.8125

But error if changes to columns are made:

mtcars %>%
  my_mutate(mpg2 = mpg / 2)

# Error in my_mutate(., mpg2 = mpg/2) : 
#   identical(sort(names(.data)), sort(names(out))) is not TRUE

mtcars %>%
  my_summarize(mpg = mean(mpg))

# Error in my_summarize(., mpg = mean(mpg)) : 
#   identical(sort(names(.data)), sort(names(out))) is not TRUE

You could edit the function operator to include more specific tests and informative error messages.

  • Related