I use the following code to get an histogram.
plt <- ggplot(iris, aes(x = Sepal.Length))
geom_histogram(aes(y = ..density..))
I'm not sure what I get with this code. I thinks that the density option inside the function geom_histogram() is used to obtain that the histogram is a density distribution, but how can I verify this? How NA are handled? Thanks
CodePudding user response:
How to verify that the area of the bins sum to 1
Your plot:
library(ggplot2)
library(tidyverse)
p <- iris %>%
ggplot(aes(x = Sepal.Length))
geom_histogram(aes(y = ..density..))
First, we can deconstruct the ggplot (so we can get at the raw data):
pp <- ggplot_build(p)
# View all data
pp$data
To get the area covered by the histogram:
heights_of_bars <- pp$data[[1]]$y
# [1] 0.2148148 0.0537037 0.2148148 0.1074074 0.5907407
# [6] 0.5370370 0.4833333 0.2148148 0.3759259 0.3759259
# [11] 0.3222222 0.4296296 0.3759259 0.4833333 0.3222222
# [16] 0.2148148 0.4833333 0.6444444 0.1074074 0.4296296
# [21] 0.1611111 0.2685185 0.0537037 0.1611111 0.0537037
# [26] 0.0537037 0.0537037 0.2148148 0.0000000 0.0537037
number_of_bins <- nrow(pp$data[[1]])
# [1] 30
x_distance <- last(pp$data[[1]]$xmax) - first(pp$data[[1]]$xmin)
# [1] [1] 3.724138
width_of_bin <- x_distance * (1/number_of_bins)
# [1] 0.1241379
And now that we know this, we can calculate the total area:
sum(heights_of_bars * width_of_bin)
# [1] 1