even though it is related to sapply - retain column names, I could not find the answer there...
I had a simple function to scale data between 0 and 1 that retained the column names:
scale <- function(x){apply(x, 2, function(y) ((y)-min(y, na.rm=TRUE))/(max(y, na.rm=TRUE)-min(y, na.rm=TRUE)))}
Now I needed to add an if clause for the case wher max(y) = min(y) and changed the function like so:
scale <- function(x){apply(x, 2, function(y) if(min(y, na.rm=TRUE)==max(y, na.rm=TRUE)) {0.5} else {((y)-min(y, na.rm=TRUE))/(max(y, na.rm=TRUE)-min(y, na.rm=TRUE))})}
Using these functions on an input data frame like so...
as.data.frame(scale(input[sapply(input,is.numeric)]))
produces different column names where the original function preserved the names and the new one modifies them in a way where brackets or hyphens are replaced with dots:
Example column name w/o the IF: INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)
Example column name w/ the IF: INL_Avg.S.B0.ETC.CDS.06C.PM_CD1_D_B0_SI_P0V_B.NM.
While I do realized these column names are not ideal it is what I need to use and I would appreciate a hint as to how to avoid this special character replacement (adding USE.NAMES=TRUE to the sapply won't help...).
Thanks, Mark
CodePudding user response:
The root of your issue is that you are using apply
on a data frame. apply
is built to work on matrices, so the first thing it does is convert your data frame to a matrix, which is unnecessary, and then the default data frame methods when you convert back "fix" the column names in a way you don't like. You may be able to fix this by adding check.names = FALSE
to your as.data.frame()
call, but a better approach would use lapply
on a data frame, apply
on a matrix, and even have it work if we give it a vector input.
I'd also strongly recommend not overwriting the built-in scale
function with a similar-but-different function. That could easily cause bugs. I've rewritten your function calling it scale01()
to make the distinction clear.
I also modified it so if the input is a constant vector with missing values, only the non-missing values will be filled in with 0.5
, which seems safer.
I use S3 dispatch to work appropriately based on the input class, built on a default
method that works on numeric vectors. Here it is, demonstrated on vector, data.frame, and matrix inputs:
## defining the functions
scale01 = function(x, ...) {
UseMethod("scale01")
}
scale01.numeric = function(x, ...) {
minx = min(x, na.rm = TRUE)
maxx = max(x, na.rm = TRUE)
if(minx == maxx) {
x[!is.na(x)] = 0.5
return(x)
}
(x - minx) / (maxx - minx)
}
scale01.data.frame = function(x, ...) {
x[] = lapply(x, scale01)
x
}
scale01.matrix = function(x, ...) {
apply(x, MARGIN = 2, FUN = scale01)
}
## demonstrating usage
scale01(rnorm(5))
# [1] 0.0000000 1.0000000 0.4198958 0.6104154 0.2108150
scale01(mtcars[1:5, ])
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000 0 1 1 1.0000000
# Mazda RX4 Wag 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195 0 1 1 1.0000000
# Datsun 710 1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765 1 1 1 0.0000000
# Hornet 4 Drive 0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000 1 0 0 0.0000000
# Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195 0 0 0 0.3333333
scale01(as.matrix(mtcars[1:5, ]))
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.2678571 0.0000000 0 1 1 1.0000000
# Mazda RX4 Wag 0.5609756 0.5 0.2063492 0.2073171 1.00000000 0.4955357 0.1879195 0 1 1 1.0000000
# Datsun 710 1.0000000 0.0 0.0000000 0.0000000 0.93902439 0.0000000 0.7214765 1 1 1 0.0000000
# Hornet 4 Drive 0.6585366 0.5 0.5952381 0.2073171 0.00000000 0.7991071 1.0000000 1 0 0 0.0000000
# Hornet Sportabout 0.0000000 1.0 1.0000000 1.0000000 0.08536585 1.0000000 0.1879195 0 0 0 0.3333333
weird_name_df = data.frame(`weird column` = rnorm(5), `INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)` = rnorm(5), check.names = FALSE)
scale01(weird_name_df)
# weird column INL_Avg(S-B0-ETC-CDS-06C~PM_CD1_D_B0_SI_P0V_B.NM)
# 1 0.6135744 0.2237905
# 2 0.0000000 0.4086837
# 3 1.0000000 1.0000000
# 4 0.7061441 0.2803262
# 5 0.7693184 0.0000000
If you want to transform all the numeric columns of a data frame, I would suggest:
## base version
numeric_cols = sapply(your_data, is.numeric)
your_data[numeric_cols] = scale01(your_data[numeric_cols])
## dplyr version
library(dplyr)
your_data %>%
mutate(across(where(is.numeric), scale01))
CodePudding user response:
Found the solution here:
as.data.frame(scale(input[sapply(input,is.numeric)]),check.names = FALSE)