selective scaling function in r using a different data frame to scale-CodePudding

I am new to R. I am hoping to write a function that will scale all numeric columns in my data frame except for specific numeric columns (in the example below, I do not want to scale the column 'estimate'). Because of the particular context this function is being used in, I actually want to scale the data using another data frame. Below is an attempt that did not work. In this attempt original.df represents the data frame which needs to be scaled, and scaling.data represents the data used for scaling. I am trying to center the numeric original.df columns on the mean of the corresponding scaling.data columns, and divide by 2 standard deviations of scaling.data columns.

Additional information that may not be essential to a working solution:

This function will be nested in a larger function. In the larger function there is an argument called predictors, which represents the column names which need to be included in the new data frame, and are also found in the scaling data frame. This could be the vector used to iterate over for the scaling function, though this is not necessarily a requirement. (Note: This vector includes column names which reference columns that are both character and numeric, again I want the function to scale numeric columns only. The final product should include the unscaled 'estimate' column from original.df).

> predictors
[1] "color"  "weight" "height" "length"
    
>original.df
    color weight height length estimate
    1    red     10     66     40        5
    2    red     12     60     41        7
    3 yellow     12     67     48        9
    4   blue     15     55     36       10
    5 yellow     21     54     48        7
    6    red     12     54     43        5
    7    red     11     38     36        6
            
     
  >scale.data
     color weight height length estimate
    1    red     11     55     41        7
    2    red     13     67     39        9
    3 yellow     12     67     46       11
    4   blue     16      8     37        5
    5 yellow     23     10     47        9
    6    red     17     11     41       10
    7    red     16     13     37       13
                
 

    scale2sd<-function(variable){
         original.df[[variable]]<-((original.df[[variable]]) - mean(scaling.data[[variable]],na.rm=TRUE))/(2*sd(scaling.data[[variable]], na.rm=TRUE))
                            return(original.df[[variable]])
                          }
        
     new.df<-original.df %>%mutate_at((!str_detect(names(.),"estimate")&is.numeric),scale)

I need the result to be the full new scaled data frame.

Thank you so much for your time and thoughts.

CodePudding user response：

We can do the following (I'm using dplyr 1.0.7 but anything >= 1.0.0 should work):

Create a function that scales

scale_to_sd <- function(other_df, target){
      
      mean(other_df[,target], na.rm=TRUE) / 
        (2*sd(other_df[, target], na.rm=TRUE))  
    }

If you need only strict numeric columns and need to exclude some columns, we can use matches which provides more flexibility than contains e.g.

df %>% 
   mutate(across(!matches("estimate|height") & where(is.numeric),
                 ~  .x - scale_to_sd(scale_df,cur_column())))

The above will scale anything but estimate or height. One can play around with the RegEx.

    color    weight height   length estimate
1    red  8.088421     66 34.87995        5
2    red 10.088421     60 35.87995        7
3 yellow 10.088421     67 42.87995        9
4   blue 13.088421     55 30.87995       10
5 yellow 19.088421     54 42.87995        7
6    red 10.088421     54 37.87995        5
7    red  9.088421     38 30.87995        6

ORIGINAL

df %>% 
  mutate(across(contains("estimate") & where(is.numeric),
                ~  .x - scale_to_sd(scale_df,cur_column())))

Apply function across target columns

 df %>% 
       mutate(across(contains("estimate"),
                     ~  .x - scale_to_sd(scale_df,cur_column())))

Result

        color weight height length estimate
    1    red     10     66     40 3.248164
    2    red     12     60     41 5.248164
    3 yellow     12     67     48 7.248164
    4   blue     15     55     36 8.248164
    5 yellow     21     54     48 5.248164
    6    red     12     54     43 3.248164
    7    red     11     38     36 4.248164

Data used:


df <- read.table(text="color weight height length estimate
    1    red     10     66     40        5
    2    red     12     60     41        7
    3 yellow     12     67     48        9
    4   blue     15     55     36       10
    5 yellow     21     54     48        7
    6    red     12     54     43        5
    7    red     11     38     36        6", head=T)

scale_df <- read.table(text=" color weight height length estimate
    1    red     11     55     41        7
    2    red     13     67     39        9
    3 yellow     12     67     46       11
    4   blue     16      8     37        5
    5 yellow     23     10     47        9
    6    red     17     11     41       10
    7    red     16     13     37       13", head=T)

CodePudding user response：

One way with base R. Comments in the code. Thanks, Nelson, for the data 1

df <- read.table(text="color weight height length estimate
    1    red     10     66     40        5
    2    red     12     60     41        7
    3 yellow     12     67     48        9
    4   blue     15     55     36       10
    5 yellow     21     54     48        7
    6    red     12     54     43        5
    7    red     11     38     36        6", head=T)

scale_df <- read.table(text=" color weight height length estimate
    1    red     11     55     41        7
    2    red     13     67     39        9
    3 yellow     12     67     46       11
    4   blue     16      8     37        5
    5 yellow     23     10     47        9
    6    red     17     11     41       10
    7    red     16     13     37       13", head=T)

## add reference and scaling df as arguments
scale2sd <- function(ref, scale_by, variable) {
  ((ref[[variable]]) - mean(scale_by[[variable]], na.rm = TRUE)) / (2 * sd(scale_by[[variable]], na.rm = TRUE))
}
predictors <- c("color", "weight", "height", "length")
## this is to get all numeric columns that are part of your predictor variables
df_to_scale <- Filter(is.numeric, df[predictors])
## create a named vector. This is a bit awkward but it makes it easier to select
## the corresponding items in the two data frames, 
## and then replace the original columns 
num_vars <- setNames(names(df_to_scale), names(df_to_scale))                      

## this is the actual scaling job - 
## use the named vector for looping over the selected columns 
## then assign it back to the selected columns
df[num_vars] <- lapply(num_vars, function(x) scale2sd(df, scale_df, x))

df
#>    color      weight     height      length estimate
#> 1    red -0.67259271 0.58130793 -0.14222363        5
#> 2    red -0.42479540 0.47561558 -0.01777795        7
#> 3 yellow -0.42479540 0.59892332  0.85334176        9
#> 4   blue -0.05309942 0.38753862 -0.64000632       10
#> 5 yellow  0.69029252 0.36992323  0.85334176        7
#> 6    red -0.42479540 0.36992323  0.23111339        5
#> 7    red -0.54869405 0.08807696 -0.64000632        6