Predict in workflow throws that column doesn't exist-CodePudding

Given the following code

library(tidyverse)
library(lubridate)
library(tidymodels)
library(ranger)

df <- read_csv("https://raw.githubusercontent.com/norhther/datasets/main/bitcoin.csv")

df <- df %>%
  mutate(Date = dmy(Date),
         Change_Percent = str_replace(Change_Percent, "%", ""),
         Change_Percent = as.double(Change_Percent)
         ) %>%
  filter(year(Date) > 2017)

int <- interval(ymd("2020-01-20"), 
                ymd("2022-01-15"))

df <- df %>%
  mutate(covid = ifelse(Date %within% int, T, F))

df %>%
  ggplot(aes(x = Date, y = Price, color = covid))   
    geom_line()

df <- df %>%
  arrange(Date) %>%
  mutate(lag1 = lag(Price),
         lag2 = lag(lag1),
         lag3 = lag(lag2),
         profit_next_day = lead(Profit))

# modelatge
df_mod <- df %>%
  select(-covid, -Date, -Vol_K, -Profit) %>%
  mutate(profit_next_day = as.factor(profit_next_day))

set.seed(42)
data_split <- initial_split(df_mod) # 3/4
train_data <- training(data_split)
test_data  <- testing(data_split)

bitcoin_rec <- 
  recipe(profit_next_day ~ ., data = train_data) %>%
  step_naomit(all_outcomes(), all_predictors()) %>%
  step_normalize(all_numeric_predictors())

bitcoin_prep <-
  prep(bitcoin_rec)

bitcoin_train <- juice(bitcoin_prep)
bitcoin_test  <- bake(bitcoin_prep, test_data)

rf_spec <- 
  rand_forest(trees = 200) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

bitcoin_wflow <- 
  workflow() %>% 
  add_model(rf_spec) %>% 
  add_recipe(bitcoin_prep)

bitcoin_fit <-
  bitcoin_wflow %>%
  fit(data = train_data)

final_model <- last_fit(bitcoin_wflow, data_split)

collect_metrics(final_model)

final_model %>%
  extract_workflow() %>%
  predict(test_data)

The last chunk of code that extracts the workflow and predicts the test_data is throwing the error:

Error in stop_subscript(): ! Can't subset columns that don't exist. x Column profit_next_day doesn't exist.

but profit_next_day exists already in test_data, as I checked multiple times, so I don't know what is happening. Never had this error before working with tidymodels.

CodePudding user response：

The problem here comes from using step_naomit() on the outcome. In general, steps that change rows (such as removing them) can be pretty tricky when it comes time to resample or predict on new data. You can read more in detail in our book, but I would suggest that you remove step_naomit() altogether from your recipe and change your earlier code to:

df_mod <- df %>%
  select(-covid, -Date, -Vol_K, -Profit) %>%
  mutate(profit_next_day = as.factor(profit_next_day)) %>%
  na.omit()

CodePudding user response：

Run collect_notes(final_model) for the reason. Vol and market_total are character format after read_csv?