I have below data frame df3.
City | Income | Cost | Age |
---|---|---|---|
NY | 1237 | 2432 | 43 |
NY | 6352 | 8632 | 32 |
Boston | 6487 | 2846 | 54 |
NJ | 6547 | 7353 | 42 |
Boston | 7564 | 7252 | 21 |
NY | 9363 | 7563 | 35 |
Boston | 3262 | 7352 | 54 |
NY | 9473 | 8667 | 76 |
NJ | 6234 | 4857 | 31 |
Boston | 5242 | 7684 | 39 |
NJ | 7483 | 4748 | 47 |
NY | 9273 | 6573 | 53 |
I need to create a function 'ST' to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.
variable | Mean | SD |
---|---|---|
Income | XX | XX |
Cost | XX | XX |
Age | XX | XX |
XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), median,2)
ST <- function(c) {
if (df3$City == s)
dataframe (
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), mean,2),
library(dplyr)
df3 %>%
group_by(City) %>%
summarise_at(vars("Income","Cost","Age"), sd,2)
else {
"NA"
}
}
ST(NJ)
CodePudding user response:
No need to call
library(dplyr)
multiple times, and doing so in the middle of adata.frame(..)
expression is not right. Candidly, even if that were syntactically correct code (it could be with{...}
bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function,ST <- function(c) { library(dplyr); ... }
.From
?summarize_at
,Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), ...
I'll demo the use of
across
.summarize
can be given multiple (named) functions at once, I'll show that, too.Your
if (df3$City == .)
is wrong for a few reasons, notably becauseif
requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning alogical
vector as long as the number of rows indf3
. A better tactic is to usedplyr::filter
.Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.
ST <- function(X, city, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% city) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(everything(), names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
# variable mu sigma
# <chr> <dbl> <dbl>
# 1 Income 7140. 3550.
# 2 Cost 6773. 2576.
# 3 Age 47.8 17.7
Notice that I used City %in% city
instead of ==
; in most cases this is identical, but there are two benefits to this:
NA
inclusion works. Note thatNA == NA
returnsNA
(which stifles many conditional processing if not capture correctly) whereasNA %in% NA
returnsTRUE
, which seems more intuitive (to me at least).It allows for
city
(the function argument) to be length other than 1, such asST(df3, c("NY", "Boston"))
. While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it's good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I'll rename the function argument fromcity
tocities
, suggesting it can take more than one.)
From this use of %in%
, it might make sense to include the city name in the output; this can be done by adding a group_by
after the filter
, as in
ST <- function(X, cities, digits = 2, na.rm = TRUE) {
library(dplyr)
library(tidyr) # pivot_longer
filter(X, City %in% cities) %>%
group_by(City) %>%
summarize(across(c("Income", "Cost", "Age"),
list(mu = ~ mean(., na.rm = na.rm),
sigma = ~ sd(., na.rm = na.rm)))) %>%
pivot_longer(-City, names_pattern = "(.*)_(.*)",
names_to = c("variable", ".value")) %>%
mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
# City variable mu sigma
# <chr> <chr> <dbl> <dbl>
# 1 Boston Income 5639. 1847.
# 2 Boston Cost 6284. 2299.
# 3 Boston Age 42 15.7
# 4 NY Income 7140. 3550.
# 5 NY Cost 6773. 2576.
# 6 NY Age 47.8 17.7
Edit: I added the rounding.
CodePudding user response:
ST <- function(city_name) {
df %>%
filter(City == city_name) %>%
pivot_longer(cols = Income:Age, names_to = "variable") %>%
group_by(City, variable) %>%
summarise(mean = mean(value),
sd = sd(value), .groups = "drop")
}
ST("Boston")
# A tibble: 3 × 4
City variable mean sd
<chr> <chr> <dbl> <dbl>
1 Boston Age 42 15.7
2 Boston Cost 6284. 2299.
3 Boston Income 5639. 1847.