How to use %>% in tidymodels in R?-CodePudding

I am trying to split a dataset from tidymodels in R.

library(tidymodels)
data(Sacramento, package = "modeldata")
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)

I want to describe the distribution of the training dataset, but the following error occurs.

Sac_train %>% 
      select(price) %>%
      summarize(min_sell_price = min(),
                max_sell_price = max(),
                mean_sell_price = mean(),
                sd_sell_price = sd())
# Error: In min() : no non-missing arguments to min; returning Inf

However, the following code works.

Sac_train %>%
  summarize(min_sell_price = min(price),
            max_sell_price = max(price),
            mean_sell_price = mean(price),
            sd_sell_price = sd(price))

My question is: why select(price) is not working in the first example? Thanks.

CodePudding user response：

Assuming your data are a data frame, despite having only one column selected, you still need to tell R/dplyr what column you want to summarize.

In other words, it doesn't treat a single-column data frame as a vector that you can pass through a function - i.e.:

Sac_train.vec <- 1:25
mean(Sac_train.vec)
# [1] 13

will calculate the mean, whereas

Sac_train.df <- data.frame(price = 1:25)
mean(Sac_train.df)

throws an error.

In the special case of only one column, this may be more parsimonious code:

# Example Data
Sac_train <- data.frame(price = 1:25, col2 = LETTERS[1:25])

Sac_train %>% 
  select(price) %>%
  summarize(across(everything(), 
                   list(min = min, max = max, mean = mean, sd = sd)))

Output:

#   price_min price_max price_mean price_sd
# 1         1        25         13 7.359801