Home > OS >  output step_lencode_mixed (from R package embed)
output step_lencode_mixed (from R package embed)

Time:01-12

I have three questions about the sample code below which illustrates the use of step_lencode_mixed.

I read in the vignette that: "For each factor predictor, a generalized linear model is fit to the outcome and the coefficients are returned as the encoding."

In the output from the example below the column 'partial' is the return from step_lencode_mixed. My questions:

  1. Should I use this partial as encoded catagorical variabele "where_town" in the new model to be fitted?
  2. Is there a complete model (Class ~ ., data = okc_train) with all variables on Class fitted in the background and is the contribution from variabele "where_town" returned as partial?
  3. If I convert the partial with the logit2prob function, I notice that the outcome is almost identical to the rate. For that reason I suppose the outcome is not a coefficient?

Thanks a lot!

# ------------------------------------------------------------------------------
# Feature Engineering and Selection: A Practical Approach for Predictive Models
# by Max Kuhn and Kjell Johnson
#
# ------------------------------------------------------------------------------
# 
# Code for Section 5.4 at
# https://bookdown.org/max/FES/categorical-supervised-encoding.html
#
# ------------------------------------------------------------------------------
# 
# Code requires these packages: 

library(tidymodels)
library(embed)

# Create example data ----------------------------------------------------------

load("../Data_Sets/OkCupid/okc.RData")
load("../Data_Sets/OkCupid/okc_binary.RData")

options(width = 120)

    partial_rec <- 
      recipe(Class ~ ., data = okc_train) %>%
      step_lencode_mixed(
        where_town,
        outcome = vars(Class)
      ) %>%
      prep()
    
    
    okc_props <- 
      okc_train %>%
      group_by(where_town) %>%
      summarise(
        rate = mean(Class == "stem"),
        raw  = log(rate/(1-rate)),
        n = length(Class)
      ) %>%
      mutate(where_town = as.character(where_town))
    
    okc_props
    
    
    # Organize results -------------------------------------------------------------
    
    partial_pooled <- 
      tidy(partial_rec, number = 1) %>%
      dplyr::select(-terms, -id) %>%
      setNames(c("where_town", "partial"))
    
    partial_pooled <- left_join(partial_pooled, okc_props)
    
    logit2prob <- function(logit){
      odds <- exp(logit)
      prob <- odds / (1   odds)
      return(prob)
    }
    
    partial_pooled$prob_partial <- logit2prob(partial_pooled$partial)
    
    head(partial_pooled)

Output:

# A tibble: 6 × 6
  where_town        partial   rate   raw     n prob_partial
  <chr>               <dbl>  <dbl> <dbl> <int>        <dbl>
1 alameda             -1.68 0.157  -1.68   616        0.157
2 albany              -1.48 0.192  -1.44   146        0.185
3 belmont             -1.25 0.234  -1.19   167        0.222
4 belvedere_tiburon   -2.02 0.0857 -2.37    35        0.117
5 benicia             -2.03 0.107  -2.13   122        0.116
6 berkeley            -1.64 0.163  -1.64  2676        0.163

CodePudding user response:

Should I use this partial as encoded catagorical variabele "where_town" in the new model to be fitted?

Yes. You don't have to do it manually though. The bake() function does that for you automatically (same as if you include the recipe in a workflow)

Is there a complete model (Class ~ ., data = okc_train) with all variables on Class fitted in the background and is the contribution from variable "where_town" returned as partial?

Yes. There is more information in the tidymodels book (section 17.3).

If I convert the partial with the logit2prob function, I notice that the outcome is almost identical to the rate. For that reason, I suppose the outcome is not a coefficient?

A simpler method to do the conversion to the rate is binomial()$linkinv(partial_pooled$partial).

The value given in the partial column is the log-odds value (hence the negative numbers); we use logistic regression (mixed model) to estimate. It uses an empirical Bayes estimation method that shrinks the coefficient estimates toward the overall (population) estimate.

The amount of shrinkage, for this model, is based on a few things but is mostly driven by the per-category sample size. Smaller sample sizes are affected more than categories with larger amounts of data. So the raw and shrunken estimates for berkeley are about the same since there were 2676 data points there but belvedere_tiburon has larger differences in estimates because the sample size was 35.

  • Related