I am new to R. I am hoping to write a function that will scale all numeric columns in my data frame except for specific numeric columns (in the example below, I do not want to scale the column 'estimate'). Because of the particular context this function is being used in, I actually want to scale the data using another data frame. Below is an attempt that did not work. In this attempt original.df represents the data frame which needs to be scaled, and scaling.data represents the data used for scaling. I am trying to center the numeric original.df columns on the mean of the corresponding scaling.data columns, and divide by 2 standard deviations of scaling.data columns.
Additional information that may not be essential to a working solution:
This function will be nested in a larger function. In the larger function there is an argument called predictors, which represents the column names which need to be included in the new data frame, and are also found in the scaling data frame. This could be the vector used to iterate over for the scaling function, though this is not necessarily a requirement. (Note: This vector includes column names which reference columns that are both character and numeric, again I want the function to scale numeric columns only. The final product should include the unscaled 'estimate' column from original.df).
> predictors
[1] "color" "weight" "height" "length"
>original.df
color weight height length estimate
1 red 10 66 40 5
2 red 12 60 41 7
3 yellow 12 67 48 9
4 blue 15 55 36 10
5 yellow 21 54 48 7
6 red 12 54 43 5
7 red 11 38 36 6
>scale.data
color weight height length estimate
1 red 11 55 41 7
2 red 13 67 39 9
3 yellow 12 67 46 11
4 blue 16 8 37 5
5 yellow 23 10 47 9
6 red 17 11 41 10
7 red 16 13 37 13
scale2sd<-function(variable){
original.df[[variable]]<-((original.df[[variable]]) - mean(scaling.data[[variable]],na.rm=TRUE))/(2*sd(scaling.data[[variable]], na.rm=TRUE))
return(original.df[[variable]])
}
new.df<-original.df %>%mutate_at((!str_detect(names(.),"estimate")&is.numeric),scale)
I need the result to be the full new scaled data frame.
Thank you so much for your time and thoughts.
CodePudding user response:
We can do the following (I'm using dplyr
1.0.7 but anything >= 1.0.0 should work):
Create a function that scales
scale_to_sd <- function(other_df, target){
mean(other_df[,target], na.rm=TRUE) /
(2*sd(other_df[, target], na.rm=TRUE))
}
If you need only strict numeric
columns and need to exclude some columns, we can use matches
which provides more flexibility than contains
e.g.
df %>%
mutate(across(!matches("estimate|height") & where(is.numeric),
~ .x - scale_to_sd(scale_df,cur_column())))
The above will scale anything but estimate or height. One can play around with the RegEx.
color weight height length estimate
1 red 8.088421 66 34.87995 5
2 red 10.088421 60 35.87995 7
3 yellow 10.088421 67 42.87995 9
4 blue 13.088421 55 30.87995 10
5 yellow 19.088421 54 42.87995 7
6 red 10.088421 54 37.87995 5
7 red 9.088421 38 30.87995 6
ORIGINAL
df %>%
mutate(across(contains("estimate") & where(is.numeric),
~ .x - scale_to_sd(scale_df,cur_column())))
Apply function across target columns
df %>%
mutate(across(contains("estimate"),
~ .x - scale_to_sd(scale_df,cur_column())))
Result
color weight height length estimate
1 red 10 66 40 3.248164
2 red 12 60 41 5.248164
3 yellow 12 67 48 7.248164
4 blue 15 55 36 8.248164
5 yellow 21 54 48 5.248164
6 red 12 54 43 3.248164
7 red 11 38 36 4.248164
Data used:
df <- read.table(text="color weight height length estimate
1 red 10 66 40 5
2 red 12 60 41 7
3 yellow 12 67 48 9
4 blue 15 55 36 10
5 yellow 21 54 48 7
6 red 12 54 43 5
7 red 11 38 36 6", head=T)
scale_df <- read.table(text=" color weight height length estimate
1 red 11 55 41 7
2 red 13 67 39 9
3 yellow 12 67 46 11
4 blue 16 8 37 5
5 yellow 23 10 47 9
6 red 17 11 41 10
7 red 16 13 37 13", head=T)
CodePudding user response:
One way with base R. Comments in the code. Thanks, Nelson, for the data 1
df <- read.table(text="color weight height length estimate
1 red 10 66 40 5
2 red 12 60 41 7
3 yellow 12 67 48 9
4 blue 15 55 36 10
5 yellow 21 54 48 7
6 red 12 54 43 5
7 red 11 38 36 6", head=T)
scale_df <- read.table(text=" color weight height length estimate
1 red 11 55 41 7
2 red 13 67 39 9
3 yellow 12 67 46 11
4 blue 16 8 37 5
5 yellow 23 10 47 9
6 red 17 11 41 10
7 red 16 13 37 13", head=T)
## add reference and scaling df as arguments
scale2sd <- function(ref, scale_by, variable) {
((ref[[variable]]) - mean(scale_by[[variable]], na.rm = TRUE)) / (2 * sd(scale_by[[variable]], na.rm = TRUE))
}
predictors <- c("color", "weight", "height", "length")
## this is to get all numeric columns that are part of your predictor variables
df_to_scale <- Filter(is.numeric, df[predictors])
## create a named vector. This is a bit awkward but it makes it easier to select
## the corresponding items in the two data frames,
## and then replace the original columns
num_vars <- setNames(names(df_to_scale), names(df_to_scale))
## this is the actual scaling job -
## use the named vector for looping over the selected columns
## then assign it back to the selected columns
df[num_vars] <- lapply(num_vars, function(x) scale2sd(df, scale_df, x))
df
#> color weight height length estimate
#> 1 red -0.67259271 0.58130793 -0.14222363 5
#> 2 red -0.42479540 0.47561558 -0.01777795 7
#> 3 yellow -0.42479540 0.59892332 0.85334176 9
#> 4 blue -0.05309942 0.38753862 -0.64000632 10
#> 5 yellow 0.69029252 0.36992323 0.85334176 7
#> 6 red -0.42479540 0.36992323 0.23111339 5
#> 7 red -0.54869405 0.08807696 -0.64000632 6