I am trying to split a dataset from tidymodels in R.
library(tidymodels)
data(Sacramento, package = "modeldata")
data_split <- initial_split(Sacramento, prop = 0.75, strata = price)
Sac_train <- training(data_split)
I want to describe the distribution of the training dataset, but the following error occurs.
Sac_train %>%
select(price) %>%
summarize(min_sell_price = min(),
max_sell_price = max(),
mean_sell_price = mean(),
sd_sell_price = sd())
# Error: In min() : no non-missing arguments to min; returning Inf
However, the following code works.
Sac_train %>%
summarize(min_sell_price = min(price),
max_sell_price = max(price),
mean_sell_price = mean(price),
sd_sell_price = sd(price))
My question is: why select(price)
is not working in the first example? Thanks.
CodePudding user response:
Assuming your data are a data frame, despite having only one column selected, you still need to tell R/dplyr what column you want to summarize.
In other words, it doesn't treat a single-column data frame as a vector that you can pass through a function - i.e.:
Sac_train.vec <- 1:25
mean(Sac_train.vec)
# [1] 13
will calculate the mean, whereas
Sac_train.df <- data.frame(price = 1:25)
mean(Sac_train.df)
throws an error.
In the special case of only one column, this may be more parsimonious code:
# Example Data
Sac_train <- data.frame(price = 1:25, col2 = LETTERS[1:25])
Sac_train %>%
select(price) %>%
summarize(across(everything(),
list(min = min, max = max, mean = mean, sd = sd)))
Output:
# price_min price_max price_mean price_sd
# 1 1 25 13 7.359801