Home > Net >  Create a function to get summary statistics of a data frame in R
Create a function to get summary statistics of a data frame in R

Time:12-06

I have below data frame df3.

City Income Cost Age
NY 1237 2432 43
NY 6352 8632 32
Boston 6487 2846 54
NJ 6547 7353 42
Boston 7564 7252 21
NY 9363 7563 35
Boston 3262 7352 54
NY 9473 8667 76
NJ 6234 4857 31
Boston 5242 7684 39
NJ 7483 4748 47
NY 9273 6573 53

I need to create a function 'ST' to get mean and standard diviation when the city is given. As an example, if I give ST(NY), I should get a table like below.

variable Mean SD
Income XX XX
Cost XX XX
Age XX XX

XX are the values in 2 decimal places. I wrote few codes but I am struggeling to concatenate these codes to get one fucntion. Below are my codes.

library(dplyr)
df3 %>%
   group_by(City) %>% 
   summarise_at(vars("Income","Cost","Age"), median,2)

ST <- function(c) {
  if (df3$City == s)
    dataframe (
    library(dplyr)
    df3 %>%
       group_by(City) %>% 
       summarise_at(vars("Income","Cost","Age"), mean,2),
    library(dplyr)
    df3 %>%
       group_by(City) %>% 
       summarise_at(vars("Income","Cost","Age"), sd,2)
  else {
    "NA"
  }
}
ST(NJ)

CodePudding user response:

  1. No need to call library(dplyr) multiple times, and doing so in the middle of a data.frame(..) expression is not right. Candidly, even if that were syntactically correct code (it could be with {...} bracing), it is generally considered better to put things like that at the beginning of the function, organizing the code. Put it at the beginning of your function, ST <- function(c) { library(dplyr); ... }.

  2. From ?summarize_at,

    Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb. See vignette("colwise") for details.), ...

    I'll demo the use of across.

  3. summarize can be given multiple (named) functions at once, I'll show that, too.

  4. Your if (df3$City == .) is wrong for a few reasons, notably because if requires its conditional to be exactly length-1 (anything else is an error, a warning, and/or logical failure) but the test is returning a logical vector as long as the number of rows in df3. A better tactic is to use dplyr::filter.

  5. Your function is using objects that were neither passed to it nor defined within it, this is bad practice. Best practice is to pass the data and arguments in the function call.

ST <- function(X, city, na.rm = TRUE) {
  library(dplyr)
  library(tidyr) # pivot_longer
  filter(X, City %in% city) %>%
    summarize(across(c("Income", "Cost", "Age"), 
                     list(mu = ~ mean(., na.rm = na.rm),
                          sigma = ~ sd(., na.rm = na.rm)))) %>%
    pivot_longer(everything(), names_pattern = "(.*)_(.*)",
                 names_to = c("variable", ".value"))
}
ST(df3, "NY")
# # A tibble: 3 x 3
#   variable     mu  sigma
#   <chr>     <dbl>  <dbl>
# 1 Income   7140.  3550. 
# 2 Cost     6773.  2576. 
# 3 Age        47.8   17.7

Notice that I used City %in% city instead of ==; in most cases this is identical, but there are two benefits to this:

  1. NA inclusion works. Note that NA == NA returns NA (which stifles many conditional processing if not capture correctly) whereas NA %in% NA returns TRUE, which seems more intuitive (to me at least).

  2. It allows for city (the function argument) to be length other than 1, such as ST(df3, c("NY", "Boston")). While that may not be a necessary thing for this function, it can be a handy utility in other function definitions, and can be a good thing to consider. Said differently and in CS-speak, it's good to think about a function handling not just "1" or "2" static things, but perhaps "1 or more" or "0 or more" (relatively unlimited number of arguments). (For this, I'll rename the function argument from city to cities, suggesting it can take more than one.)

From this use of %in%, it might make sense to include the city name in the output; this can be done by adding a group_by after the filter, as in

ST <- function(X, cities, digits = 2, na.rm = TRUE) {
  library(dplyr)
  library(tidyr) # pivot_longer
  filter(X, City %in% cities) %>%
    group_by(City) %>%
    summarize(across(c("Income", "Cost", "Age"), 
                     list(mu = ~ mean(., na.rm = na.rm),
                          sigma = ~ sd(., na.rm = na.rm)))) %>%
    pivot_longer(-City, names_pattern = "(.*)_(.*)",
                 names_to = c("variable", ".value")) %>%
    mutate(across(c(mu, sigma), ~ round(., digits)))
}
ST(df3, c("NY", "Boston"))
# # A tibble: 6 x 4
#   City   variable     mu  sigma
#   <chr>  <chr>     <dbl>  <dbl>
# 1 Boston Income   5639.  1847. 
# 2 Boston Cost     6284.  2299. 
# 3 Boston Age        42     15.7
# 4 NY     Income   7140.  3550. 
# 5 NY     Cost     6773.  2576. 
# 6 NY     Age        47.8   17.7

Edit: I added the rounding.

CodePudding user response:

ST <- function(city_name) {
  df %>%  
    filter(City == city_name) %>% 
    pivot_longer(cols = Income:Age, names_to = "variable") %>%  
    group_by(City, variable) %>%  
    summarise(mean = mean(value), 
              sd = sd(value), .groups = "drop")
}

ST("Boston")

# A tibble: 3 × 4
  City   variable  mean     sd
  <chr>  <chr>    <dbl>  <dbl>
1 Boston Age        42    15.7
2 Boston Cost     6284. 2299. 
3 Boston Income   5639. 1847. 
  • Related