Home > other >  Referring to the input data of ggplot and use that in a custom function within a geom
Referring to the input data of ggplot and use that in a custom function within a geom

Time:06-28

I'm using ggplot geom_vline in combination with a custom function to plot certain values on top of a histogram.

The example function below e.g. returns a vector of three values (the mean and x sds below or above the mean). I can now plot these values in geom_vline(xintercept) and see them in my graph.

#example function
sds_around_the_mean <- function(x, multiplier = 1) {
  mean <- mean(x, na.rm = TRUE)
  sd <- sd(x, na.rm = TRUE)
  
  tibble(low   = mean - multiplier * sd,
         mean  = mean,
         high  = mean   multiplier * sd) %>% 
    pivot_longer(cols = everything()) %>% 
    pull(value)
}

Reproducible data

    #data
set.seed(123)
normal <- tibble(data = rnorm(1000, mean = 100, sd = 5))
outliers <- tibble(data = runif(5, min = 150, max = 200))

df <- bind_rows(lst(normal, outliers), .id = "type")

df %>% 
  ggplot(aes(x = data))   
  geom_histogram(bins = 100)   
  geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 3),
             linetype = "dashed", color = "red")   
  geom_vline(xintercept = sds_around_the_mean(df$data, multiplier = 2),
             linetype = "dashed")

example_hist

The problem is, that as you can see I would have to define data$df at various places. This becomes more error-prone when I apply any change to the original df that I pipe into ggplot, e.g. filtering out outliers before plotting. I would have to apply the same changes again at multiple places.

E.g.
df %>% filter(type == "normal")
#also requires 
df$data 
#to be changed to 
df$data[df$type == "normal"] 
#in geom_vline to obtain the correct input values for the xintercept.

So instead, how could I replace the df$data argument with the respective column of whatever has been piped into ggplot() in the first place? Something similar to the "." operator, I assume. I've also tried stat_summary with geom = "vline" to achieve this, but without the desired effect.

CodePudding user response:

You can enclose the ggplot part in curly brackets and reference the incoming dataset with the . symbol both in the ggplot command and when calculating the sds_around_the_mean. This will make it dynamic.

df %>% 
  {ggplot(data = ., aes(x = data))   
  geom_histogram(bins = 100)   
  geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 3),
             linetype = "dashed", color = "red")   
  geom_vline(xintercept = sds_around_the_mean(.$data, multiplier = 2),
             linetype = "dashed")}
  • Related