How to verify if the total area of the bins is 1?-CodePudding

I use the following code to get an histogram.

plt <- ggplot(iris, aes(x = Sepal.Length))  
    geom_histogram(aes(y = ..density..))

I'm not sure what I get with this code. I thinks that the density option inside the function geom_histogram() is used to obtain that the histogram is a density distribution, but how can I verify this? How NA are handled? Thanks

CodePudding user response：

How to verify that the area of the bins sum to 1

Your plot:

library(ggplot2)
library(tidyverse)

p <- iris %>% 
  ggplot(aes(x = Sepal.Length))  
  geom_histogram(aes(y = ..density..))

First, we can deconstruct the ggplot (so we can get at the raw data):

pp <- ggplot_build(p)

# View all data
pp$data

To get the area covered by the histogram:

heights_of_bars <- pp$data[[1]]$y

#  [1] 0.2148148 0.0537037 0.2148148 0.1074074 0.5907407
#  [6] 0.5370370 0.4833333 0.2148148 0.3759259 0.3759259
# [11] 0.3222222 0.4296296 0.3759259 0.4833333 0.3222222
# [16] 0.2148148 0.4833333 0.6444444 0.1074074 0.4296296
# [21] 0.1611111 0.2685185 0.0537037 0.1611111 0.0537037
# [26] 0.0537037 0.0537037 0.2148148 0.0000000 0.0537037

number_of_bins <- nrow(pp$data[[1]])
# [1] 30 

x_distance <- last(pp$data[[1]]$xmax) - first(pp$data[[1]]$xmin)
# [1] [1] 3.724138

width_of_bin <- x_distance * (1/number_of_bins)
# [1] 0.1241379

And now that we know this, we can calculate the total area:

sum(heights_of_bars * width_of_bin)
# [1] 1