Computing densities for discrete data in R-CodePudding

I want to plot a density histogram in R with discrete data. Given that my dataset is very large I need to compute the densities prior plotting.

However, I find that using the stats::density function provides different results compared to ..density.. in ggplot2. Why is that? Also the ..density.. results from ggplot2 are in line with what is expected whereas stats::density are not.

See reproducible example below.

Many Thanks

library(tidyverse)
library(patchwork)

df <- data.frame(A = round(rnorm(1000)),
           B = round(rnorm(1000)),
           C = round(rnorm(1000))) %>% 
  pivot_longer(cols = everything(), names_to = "group")

dens_df <- df %>% 
  group_by(group) %>% 
  summarise(dens = list(density(value, from = -3, to = 6, n = length(-3:6)))) %>% #compute density and nest into list
  mutate(density.x = map(dens, ~.x[["x"]]), #extract x values
         density.y = map(dens, ~.x[["y"]])) %>%  #extract y values
  select(-dens) %>% 
  unnest(cols = c(density.x, density.y))

plot_dens <- dens_df %>% 
  ggplot() 
  aes(x = density.x, y = density.y) %>% 
  geom_col() 
  scale_x_continuous(breaks = seq(-3,10,1)) 
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", size = 2, col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "line", col = "red")  
  facet_wrap(~group) 
  labs(title = "using stats::density")

plot_geom <- df %>% 
  ggplot() 
  aes(x = value, y = ..density..) %>% 
  geom_histogram(
    binwidth = 1,
    col = "white") 
  scale_x_continuous(breaks = seq(-3,10,1)) 
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "point", size = 2, col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), geom = "line", col = "red")  
  facet_wrap(~group) 
  labs(title = "using ggplot2 ..density..")


plot_dens   plot_geom

CodePudding user response：

The simple answer is that stats::density isn't the correct function to use on discrete data. stats::density uses a smoothing kernel and will provide a curve representing the continuous density between -3 and 6, even if only sampled at 10 points. This is not the same as the densities of discrete data, which is just the frequency of each bin divided by the number of observations. This is what geom_histogram is drawing.

The computationally efficient way to get discrete densities is to use hist or even just table(x)/length(x).

Here's an example using precomputed densities from hist

dens_df <- df %>% 
  group_by(group) %>% 
  summarise(dens = list(hist(value, breaks = seq(-3.5, 6.5), plot = FALSE))) %>% 
  mutate(density.x = map(dens, ~.x[["mids"]]), #extract x values
         density.y = map(dens, ~.x[["density"]])) %>%  #extract y values
  select(-dens) %>% 
  unnest(cols = c(density.x, density.y))

We can see this leads to the same plot as geom_histogram, using your own plotting code but with the altered dens_df

plot_dens <- dens_df %>% 
  ggplot() 
  aes(x = density.x, y = density.y) %>% 
  geom_col() 
  scale_x_continuous(breaks = seq(-3,10,1)) 
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1),
                geom = "point", col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "point", size = 2, col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "line", col = "red")  
  facet_wrap(~group) 
  labs(title = "using graphics::hist")

plot_geom <- df %>% 
  ggplot() 
  aes(x = value, y = ..density..) %>% 
  geom_histogram(
    binwidth = 1,
    col = "white") 
  scale_x_continuous(breaks = seq(-3,10,1)) 
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "point", col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "point", size = 2, col = "red")  
  stat_function(fun = dnorm, n = 10, args = list(mean = 0, sd = 1), 
                geom = "line", col = "red")  
  facet_wrap(~group) 
  labs(title = "using ggplot2 ..density..")

plot_dens   plot_geom