Home > database >  Binned Histogram with overlay of empirical and/or normal distribution
Binned Histogram with overlay of empirical and/or normal distribution

Time:12-15

I am trying to look at the frequency distribution of a certain variable. Due to the large amount of data, I have created bins for a range of values and I'm plotting the count of each bin. I want to be able to overlay lines which will represent both the empirical distribution seen by my data, and what a theoretically normal distribution would look like. I can accomplish this without pre-binning my data or using ggplot2 by doing something such as this:

df <- ggplot2::diamonds
hist(df$price,freq = FALSE)
lines(density(df$price),lwd=3,col="blue")

or with ggplot2 as such:

mean_price <- mean(df$price)
sd_price <- sd(df$price)

ggplot(df, aes(x = price))  
  geom_histogram(aes(y = ..density..), 
                 bins = 40,  colour = "black", fill = "white")  
  geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density')       
  stat_function(fun = dnorm, aes(color = 'Normal'),
                args = list(mean = mean_price, sd = sd_price))  
  scale_colour_manual(name = "Colors", values = c("red", "blue"))

but I cannot figure out how to overlay similar lines on my pre-binned data:

breaks <- seq(from=min(df$price),to=max(df$price),length.out=11)
price_freq <- cut(df$price,breaks = breaks,right = TRUE,include.lowest = TRUE)
ggplot(data = df,mapping = aes(x=price_freq))  
  stat_count()  
  theme(axis.text.x = element_text(angle = 270)) 
  #   geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density')       
  # stat_function(fun = dnorm, aes(color = 'Normal'),
  #               args = list(mean = mean_price, sd = sd_price))  
  # scale_colour_manual(name = "Colors", values = c("red", "blue"))

Any ideas?

CodePudding user response:

Take a look at the PearsonDS package ( I am guessing you are not using rnorm for a reason). The easiest approach may be to generate a vector of data that meets your requirements and map that vector using geom_line.

library("PearsonDS")
df <- rpearson(5000,moments=c(mean=10,variance=2,skewness=0,kurtosis=3))

CodePudding user response:

Your problem is that cut gives you a factor/character for your x-axis. You need a numeric x-axis to add the other layers. A first step might be to try the following. I added a small fudge to get the last bin to work out.

library(tidyverse)
df <- ggplot2::diamonds

mean_price <- mean(df$price)
sd_price <- sd(df$price)

num_bins <- 40
breaks <- seq(from=min(df$price),to=max(df$price) 1e-10,length.out=num_bins 1)
midpoints <- (breaks[1:num_bins]   breaks[2:(num_bins 1)])/2

precomputed <- df %>% 
    mutate(bin_left = breaks[findInterval(price, breaks)],
           bin_mid = midpoints[findInterval(price, breaks)]) %>%
    count(bin_mid) 

precomputed %>% 
    ggplot(aes(x = bin_mid, weight = n))  
    geom_histogram(aes(y = ..density..), bins = num_bins, boundary = breaks[1], colour = "black", fill = "white")  
    geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density')  
    stat_function(fun = dnorm, aes(color = 'Normal'),
                  args = list(mean = mean_price, sd = sd_price))  
    scale_colour_manual(name = "Colors", values = c("red", "blue"))

But you will notice that the red Empirical curve is quite different from your ggplot2 example. The reason is that here it is being computed using the summary data which moves all x-values to the bin midpoint. You will need to pre-compute this empirical curve, or drop it and rely on the histogram to represent this data.

Sorry for the partial answer.

  • Related