Create a function to get summary statistics of a data frame in R-CodePudding

I have below data frame df3.

City	Income	Cost	Age
NY	1237	2432	43
NY	6352	8632	32
Boston	6487	2846	54
NJ	6547	7353	42
Boston	7564	7252	21
NY	9363	7563	35
Boston	3262	7352	54
NY	9473	8667	76
NJ	6234	4857	31
Boston	5242	7684	39
NJ	7483	4748	47
NY	9273	6573	53

I need to create a function 'ST' to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.

variable	Mean	SD
Income	XX	XX
Cost	XX	XX
Age	XX	XX

XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.

library(dplyr)
df3 %>%
   group_by(City) %>% 
   summarise_at(vars("Income","Cost","Age"), median,2)

ST <- function(c) {
  if (df3$City == s)
    dataframe (
    library(dplyr)
    df3 %>%
       group_by(City) %>% 
       summarise_at(vars("Income","Cost","Age"), mean,2),
    library(dplyr)
    df3 %>%
       group_by(City) %>% 
       summarise_at(vars("Income","Cost","Age"), sd,2)
  else {
    "NA"
  }
}
ST(NJ)

CodePudding user response：

No need to call library(dplyr) multiple times, and doing so in the middle of a data.frame(..) expression is not right. Candidly, even if that were syntactically correct code (it could be with {...} bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, ST <- function(c) { library(dplyr); ... }.
From ?summarize_at,

Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), ...

I'll demo the use of across.
summarize can be given multiple (named) functions at once, I'll show that, too.
Your if (df3$City == .) is wrong for a few reasons, notably because if requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a logical vector as long as the number of rows in df3. A better tactic is to use dplyr::filter.
Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.

ST <- function(X, city, na.rm = TRUE) {
  library(dplyr)
  library(tidyr) # pivot_longer
  filter(X, City %in% city) %>%
    summarize(across(c("Income", "Cost", "Age"), 
                     list(mu = ~ mean(., na.rm = na.rm),
                          sigma = ~ sd(., na.rm = na.rm)))) %>%
    pivot_longer(everything(), names_pattern = "(.*)_(.*)",
                 names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
#   variable     mu  sigma
#   <chr>     <dbl>  <dbl>
# 1 Income   7140.  3550. 
# 2 Cost     6773.  2576. 
# 3 Age        47.8   17.7

Notice that I used City %in% city instead of ==; in most cases this is identical, but there are two benefits to this:

NA inclusion works. Note that NA == NA returns NA (which stifles many conditional processing if not capture correctly) whereas NA %in% NA returns TRUE, which seems more intuitive (to me at least).
It allows for city (the function argument) to be length other than 1, such as ST(df3, c("NY", "Boston")). While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it's good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I'll rename the function argument from city to cities, suggesting it can take more than one.)

From this use of %in%, it might make sense to include the city name in the output; this can be done by adding a group_by after the filter, as in

ST <- function(X, cities, digits = 2, na.rm = TRUE) {
  library(dplyr)
  library(tidyr) # pivot_longer
  filter(X, City %in% cities) %>%
    group_by(City) %>%
    summarize(across(c("Income", "Cost", "Age"), 
                     list(mu = ~ mean(., na.rm = na.rm),
                          sigma = ~ sd(., na.rm = na.rm)))) %>%
    pivot_longer(-City, names_pattern = "(.*)_(.*)",
                 names_to = c("variable", ".value")) %>%
    mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
#   City   variable     mu  sigma
#   <chr>  <chr>     <dbl>  <dbl>
# 1 Boston Income   5639.  1847. 
# 2 Boston Cost     6284.  2299. 
# 3 Boston Age        42     15.7
# 4 NY     Income   7140.  3550. 
# 5 NY     Cost     6773.  2576. 
# 6 NY     Age        47.8   17.7

Edit: I added the rounding.

CodePudding user response：

ST <- function(city_name) {
  df %>%  
    filter(City == city_name) %>% 
    pivot_longer(cols = Income:Age, names_to = "variable") %>%  
    group_by(City, variable) %>%  
    summarise(mean = mean(value), 
              sd = sd(value), .groups = "drop")
}

ST("Boston")

# A tibble: 3 × 4
  City   variable  mean     sd
  <chr>  <chr>    <dbl>  <dbl>
1 Boston Age        42    15.7
2 Boston Cost     6284. 2299. 
3 Boston Income   5639. 1847.