Should I subset my data before normalizing in R?-CodePudding

I am very new to working with data in this fashion. I have the following dataset:

var     y
1    r 1.072
2    r 1.067
3    r 1.072
4    r 1.068
5    r 1.072
6    r 1.068
7    r 1.075
8    r 1.076
9    r 1.072
10   r 1.073
11   r 1.077
12   r 1.073
13   r 1.069
14   r 1.072
15   s 1.012
16   s 1.004
17   s 1.011
18   s 1.003
19   s 1.008
20   s 1.003
21   s 1.006
22   s 1.007
23   s 1.004
24   s 1.005
25   s 1.005
26   s 1.006
27   s 1.002
28   s 1.005
29   s 1.003

I am making a predictive model with it, but because r and s have different standards for their y value, I was told to normalize them. I have created the following function to normalize:

min_max_norm <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

To do this correctly, should I be normalizing all of the data at once, like this:

norm_data <- as.data.frame(lapply(data[2], min_max_norm))

Or should I normalize each of the var's separately, then put them together in the same dataset, like this:

require(dplyr)
dataR <- data %>% filter(var == "r")
dataS <- data %>% filter(var == "s")

norm_dataR <- as.data.frame(lapply(data[2], min_max_norm))
norm_dataS <- as.data.frame(lapply(data[2], min_max_norm))

In addition, if the second method is correct, is there a way to join the two datasets back together in R?

Thanks in advance for any help. I am fairly new, so please let me know if I have done anything incorrectly in this post and I will attend to that.

CodePudding user response：

Since the data you are normalizing is different ("s" and "r"), I'd recommend separating them and then apply the function you created. If you apply before separating, the normalization would not work as intended, given the max and min values of different categories.

Since you are already familiar with the separation of data, you can apply the normalization and then just unite the data again with rbind().

CodePudding user response：

dat$y <- with(dat, ave(y, var, FUN = min_max_norm))
dat
#   var   y
#1    r 0.5
#2    r 0.0
#3    r 0.5
#4    r 0.1
#5    r 0.5
#6    r 0.1
#7    r 0.8
#8    r 0.9
#9    r 0.5
#10   r 0.6
#11   r 1.0
#12   r 0.6
#13   r 0.2
#14   r 0.5
#15   s 1.0
#16   s 0.2
#17   s 0.9
#18   s 0.1
#19   s 0.6
#20   s 0.1
#21   s 0.4
#22   s 0.5
#23   s 0.2
#24   s 0.3
#25   s 0.3
#26   s 0.4
#27   s 0.0
#28   s 0.3
#29   s 0.1

dat <- structure(list(var = c("r", "r", "r", "r", "r", "r", "r", "r", 
"r", "r", "r", "r", "r", "r", "s", "s", "s", "s", "s", "s", "s", 
"s", "s", "s", "s", "s", "s", "s", "s"), y = c(1.072, 1.067, 
1.072, 1.068, 1.072, 1.068, 1.075, 1.076, 1.072, 1.073, 1.077, 
1.073, 1.069, 1.072, 1.012, 1.004, 1.011, 1.003, 1.008, 1.003, 
1.006, 1.007, 1.004, 1.005, 1.005, 1.006, 1.002, 1.005, 1.003
)), row.names = c(NA, -29L), class = "data.frame")