Sorry, I know that this question was practically asked already here, however I still don't understand how these summary functions( fun.min, fun.max, fun) truly work. In the manual examples, these functions are oftentimes in the context of and defined as

 ggplot(data = <DATA>)   
  stat_summary(
    mapping = aes(<MAPPINGS>),
    fun.min = min,
    fun.max = max,
    fun = median
  )

From what I understand, fun.min = min defines the summary function as the min base R function, however then what happens? How does fun.min know to find the minimum value of our mappings argument? Taking a step back, why does fun.min have to even be defined as the min base R function, since isn't it already defined as finding the minimum value in a vector and then returning a single number?

CodePudding user response：

Let's start with a simple example:

library(ggplot2)

base <- ggplot(data = mtcars, mapping = aes(cyl, mpg))   
  geom_point()

base   stat_summary(
  fun.min = min,
  fun.max = max,
  fun = median,
  color = "red"
)

Now to your questions:

From what I understand, fun.min = min defines the summary function as the min base R function, however then what happens? How does fun.min know to find the minimum value of our mappings argument?

Under the hood stat_summary aggregates the dataset passed via the data argument, i.e. for each value or category of the x variable (or in the more general case the interaction of the x and the group aesthetic) it computes the three statistics defined by fun, fun.min and fun.max for the variable mapped on the y aes. The value computed by fun is then mapped on y, the value of fun.min on ymin and the value of fun.max on ymax.

The result of this step could be made visible via ggplot2::layer_data. For the basic example

layer_data(last_plot(), 2) # The 2 means we want the data for the second layer. geom_point = first layer, stat_summary = second layer
#>   x group    y ymin ymax PANEL flipped_aes colour size linetype shape fill
#> 1 4    -1 26.0 21.4 33.9     1       FALSE    red  0.5        1    19   NA
#> 2 6    -1 19.7 17.8 21.4     1       FALSE    red  0.5        1    19   NA
#> 3 8    -1 15.2 10.4 19.2     1       FALSE    red  0.5        1    19   NA
#>   alpha stroke
#> 1    NA      1
#> 2    NA      1
#> 3    NA      1

Besides a lot of other stuff added by ggplot2 that's basically the same as

library(dplyr, warn = FALSE)

mtcars |> 
  group_by(x = cyl) |> 
  summarise(y = median(mpg), ymin = min(mpg), ymax = max(mpg))
#> # A tibble: 3 × 4
#>       x     y  ymin  ymax
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4  26    21.4  33.9
#> 2     6  19.7  17.8  21.4
#> 3     8  15.2  10.4  19.2

Taking a step back, why does fun.min have to even be defined as the min base R function, since isn't it already defined as finding the minimum value in a vector and then returning a single number?

The answer is no. fun.min is not defined as the minimum or min function. It's NULL by default (see ?stat_summary). fun.min provides the function which is used to compute the summary statistic to be mapped on the ymin aes. If fun.min would be fixed to the minimum, stat_summary wouldn't be that useful, e.g. quite often we use fun.min and fun.max to display a confidence interval or to display the interquartile range:

base   stat_summary(
    fun.min = ~quantile(.x, probs = .25),
    fun.max = ~quantile(.x, probs = .75),
    fun = median,
    color = "red"
  )

CodePudding user response：

What do `fun.min`/`fun.max`/`fun` do?

These functions manipulate different parts of the ggplot output. They take a vector of values and produce a single value that is then placed on the figure.

fun.min assigns a value to the lower bounds of an interval. Usually, this is the minimum of the y value.
fun.max assigns a value to the upper bounds of the interval. Usually this is the maximum of the y value.
fun assigns a value to the dot that is often in the middle of the interval. Usually this is the mean of the y value.

You can see this behavior by observing how the plots below vary with the functions passed to these arguments. It helps to think about these as controlling the interval - which can represent whatever you want - rather than showing the max, min, and mean. For example, instead of displaying the minimum, maximum, and mean values with the interval, you could show an interval that calculate the 1st, 2nd, and 3rd tertile of values for the vector by passing functions that estimate each of these quantities. Below, I demonstrate passing different functions.

A plot with the "typical" functions: `min`, `max`, `mean`

set.seed(13)
library(ggplot2)

# An example ggplot
d <- ggplot(mtcars, aes(cyl, mpg))   geom_point()

## Default (fun.min = min)
d   stat_summary(fun = mean, fun.min = min, fun.max = max, colour = "red")

Pass `fun.min` a function that returns 1

d   stat_summary(fun = mean, fun.min = function(x){1}, fun.max = max, colour = "red")

Pass `fun.min` a function that returns a random value from `c(-20, 0, 20)`

d   stat_summary(fun = mean, 
                 fun.min = function(x){sample(c(-20, 0, 20), size = 1, replace = T)}, fun.max = max, colour = "red")

Pass `fun` a function that returns 1

See the dot move compared to the first plot!

d   stat_summary(fun = function(x){1}, fun.min = min, fun.max = max, colour = "red")

How does it "know" our mappings?

The documentation notes that it can be ambiguous what vector is actually being passed to these arguments. By default, ggplot tries to figure out the orientation of the figure (what variables are on the x versus y axes) and then passes the y-axis values. However, you can set the orientation manually to avoid producing unexpected results.

Here is the description of the orientation argument to stat_summary:

The orientation of the layer. The default (NA) automatically determines the orientation from the aesthetic mapping. In the rare event that this fails it can be given explicitly by setting orientation to either "x" or "y". See the Orientation section for more detail.

Alternative `geom` types

While changing the parameters affect the interval in the above examples, it is possible to change the geom type within stat_summary. Instead of layering an interval defined by 3 points on top of the plot, we could add a line instead. You'll see below that the fun.min argument now changes the scale of the axis but not the actual line's values. Nevertheless, fun itself affects the line's position.

`geom = "line"`: typical functions

d   stat_summary(fun = mean, fun.min = min, fun.max = max, colour = "red", geom = "line")

`geom = "line"`: pass `fun.min` a function that returns 5

d   stat_summary(fun = mean, fun.min = function(x){5}, fun.max = max, colour = "red", geom = "line")

`geom = "line"`: pass `fun` a function that returns 5

d   stat_summary(fun = function(x){5}, fun.min = min, fun.max = max, colour = "red", geom = "line")

What do fun.min/fun.max/fun do?

A plot with the "typical" functions: min, max, mean

Pass fun.min a function that returns 1

Pass fun.min a function that returns a random value from c(-20, 0, 20)

Pass fun a function that returns 1