Rescale (0-1; min/max) all columns in a dataframe separately (by column not the whole dataframe)-CodePudding

I have a dataframe with 300 columns (labeled like so X17.01, X24.05, X200.4...) and 500 rows. I want to rescale those columns to be between 0 and 1 but based on the min/max of each individual column. So, for example, I want to rescale column X17.01 separately from X24.05.

I have used the following codes in R (below) but both rescale the whole data frame.

Code 1:

Data_profile_standardized <- data.frame(lapply(Data_profile, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/1)))

Code 2:

normalize <- Vectorize(function(v) (v-min(v))/diff(range(v)))
dfout <- data.frame(normalize(Data_profile_standardize ))

CodePudding user response：

Let's make a nice small test case so we can see what's going on easily:

df = data.frame(X1 = 1:3, X2 = c(100, 150, 1000))

The problem with scale is not that it is applied to the whole data frame, rather it is that with center = FALSE all it does is divide by the maximum, so you don't get any 0s:

data.frame(lapply(df, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/1)))
#          X1   X2
# 1 0.3333333 0.10
# 2 0.6666667 0.15
# 3 1.0000000 1.00

The problem with your normalize function is that it is vectorized as written, and the Vectorize is not necessary. Vectorizing it makes it try to normalize each individual entry, not each column, and since the diff(range()) of a single number is 0, you are dividing by 0 and getting NaN as a result:

normalize <- Vectorize(function(v) (v-min(v))/diff(range(v)))
data.frame(lapply(df, normalize))
#    X1  X2
# 1 NaN NaN
# 2 NaN NaN
# 3 NaN NaN

Let's leave off the Vectorize (and add na.rm = TRUE for good measure, in case there are NA values in your actual data):

normalize = function(v) (v - min(v, na.rm = TRUE)) / diff(range(v, na.rm = TRUE))

data.frame(lapply(df, normalize))
#    X1         X2
# 1 0.0 0.00000000
# 2 0.5 0.05555556
# 3 1.0 1.00000000

This works!

Note that we could work more with the scale function. If you specify center = min(x), then the minimums will be subtracted and you'll get 0s... but then max(x) is no longer the correct scale factor. We need to use diff(range()) here, just like in the other methods:

# also works
data.frame(lapply(df, function(x) scale(
  x, 
  center = min(x, na.rm = TRUE),
  scale = diff(range(x, na.rm = TRUE))
)))
#    X1         X2
# 1 0.0 0.00000000
# 2 0.5 0.05555556
# 3 1.0 1.00000000

CodePudding user response：

Make a copy of the data.frame and then normalize it by lapplying the normalizing function to the data set.

Note the square brackets without which dfout doesn't keep its tabular shape.

normalize <- function(v, na.rm = FALSE) (v - min(v, na.rm = na.rm))/diff(range(v, na.rm = na.rm))

dfout <- Data_profile
dfout[] <- lapply(dfout, normalize)

With the data in Gregor Thomas's answer, the result is the following.

dfout
#   X1         X2
#1 0.0 0.00000000
#2 0.5 0.05555556
#3 1.0 1.00000000