I have a dataframe with 300 columns (labeled like so X17.01, X24.05, X200.4...) and 500 rows. I want to rescale those columns to be between 0 and 1 but based on the min/max of each individual column. So, for example, I want to rescale column X17.01 separately from X24.05.
I have used the following codes in R (below) but both rescale the whole data frame.
Code 1:
Data_profile_standardized <- data.frame(lapply(Data_profile, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/1)))
Code 2:
normalize <- Vectorize(function(v) (v-min(v))/diff(range(v)))
dfout <- data.frame(normalize(Data_profile_standardize ))
CodePudding user response:
Let's make a nice small test case so we can see what's going on easily:
df = data.frame(X1 = 1:3, X2 = c(100, 150, 1000))
The problem with scale
is not that it is applied to the whole data frame, rather it is that with center = FALSE
all it does is divide by the maximum, so you don't get any 0s:
data.frame(lapply(df, function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE)/1)))
# X1 X2
# 1 0.3333333 0.10
# 2 0.6666667 0.15
# 3 1.0000000 1.00
The problem with your normalize
function is that it is vectorized as written, and the Vectorize
is not necessary. Vectorizing it makes it try to normalize each individual entry, not each column, and since the diff(range())
of a single number is 0, you are dividing by 0 and getting NaN
as a result:
normalize <- Vectorize(function(v) (v-min(v))/diff(range(v)))
data.frame(lapply(df, normalize))
# X1 X2
# 1 NaN NaN
# 2 NaN NaN
# 3 NaN NaN
Let's leave off the Vectorize
(and add na.rm = TRUE
for good measure, in case there are NA
values in your actual data):
normalize = function(v) (v - min(v, na.rm = TRUE)) / diff(range(v, na.rm = TRUE))
data.frame(lapply(df, normalize))
# X1 X2
# 1 0.0 0.00000000
# 2 0.5 0.05555556
# 3 1.0 1.00000000
This works!
Note that we could work more with the scale
function. If you specify center = min(x)
, then the minimums will be subtracted and you'll get 0s... but then max(x)
is no longer the correct scale
factor. We need to use diff(range())
here, just like in the other methods:
# also works
data.frame(lapply(df, function(x) scale(
x,
center = min(x, na.rm = TRUE),
scale = diff(range(x, na.rm = TRUE))
)))
# X1 X2
# 1 0.0 0.00000000
# 2 0.5 0.05555556
# 3 1.0 1.00000000
CodePudding user response:
Make a copy of the data.frame and then normalize it by lapply
ing the normalizing function to the data set.
Note the square brackets without which dfout
doesn't keep its tabular shape.
normalize <- function(v, na.rm = FALSE) (v - min(v, na.rm = na.rm))/diff(range(v, na.rm = na.rm))
dfout <- Data_profile
dfout[] <- lapply(dfout, normalize)
With the data in Gregor Thomas's answer, the result is the following.
dfout
# X1 X2
#1 0.0 0.00000000
#2 0.5 0.05555556
#3 1.0 1.00000000