Home > Blockchain >  How do you plot density curves on top of the data to show distribution in R
How do you plot density curves on top of the data to show distribution in R

Time:02-11

I am looking to plot the density curves within the distribution of my data. I have heard (not not expert) on something called lognormal distribution? My data is the following:

data<- data.frame(
      Day=c(1,2,3,4,5,6,7,8,9,10),
      Variable=c(3,5,20,10,8,18,23,21,16,12))

ggplot(data)    
  geom_bar(aes(y=Variable, x=Day),stat="identity", colour="black") 
  labs(title= "",x="",y=expression('Variable'))   theme_classic()

enter image description here

And would like something like this: enter image description here

CodePudding user response:

I think you're getting a bit confused between a bar chart and a histogram. You have a bar chart, which in your case is showing the change in one variable on the y axis against time on the x axis.

A histogram looks like a bar chart, but it shows the distribution of a single variable throughout its range. This has the value of the variable on the x axis, and the frequency with which that variable was found on the y axis. It makes sense to plot a density curve over a histogram, but not over a time series. If you are looking for a similar visual effect to the plot shown, the best you can get is probably to plot a moving average along with the bars, perhaps something like this:

ggplot(data, aes(Day, Variable))    
  geom_col(colour = "black")  
  geom_smooth(se = FALSE, color = "black")  
  labs(title = "", x = "", y = expression('Variable'))    
  theme_classic()

enter image description here

CodePudding user response:

I am assuming here that your data are frequencies, therefore in order to simplify subsequent operations, first transform them into a vector of single observations.

library(ggplot2)
library(tidyr)
library(dplyr)


long <- with(data, rep(Day, times = Variable))

long[1:20]

[1] 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3

Now using ggplot you can plot the histogram and density estimate:

data.frame(Day = long) |>
  ggplot()  
  geom_bar(aes(x = Day), stat = "count")  
  geom_density(aes(x = Day, after_stat(count)))

UNIMODAL DENSITY

To fit a mixture of two gaussian distributions (as in the reference example), you can use mclust. Note that this is shown only as an example and it is probably not the best way to treat your data. Here, the parameter G specifies the number of models:

library(mclust)
fit <- densityMclust(long, G = 2)

Next, the data are prepared for plotting using ggplot in several steps:

  • the density estimates are produced for both gaussian models in a suitable range of x using predict
  • the density values are scaled to match the proportion of observations predicted to be in either model
  • the dataframe is rearranged pivoting on the x
x <- seq(0, max(long)   1, by = 0.1)

dens <- predict(fit, x, what = "cdens") |>
  apply(1, function(z) z*table(fit$classification)) |>
  t() |>
  data.frame() |>
  cbind(x = x) |>
  pivot_longer(cols = c(X1, X2),
               names_to = "K",
               values_to = "y")

Finally, the densities are layered upon the barplot, by specifying in geom_line that the source of data is dens:

data.frame(Day = long) |>
  ggplot()  
  geom_bar(aes(x = Day), stat = "count")  
  geom_line(data = dens,
            aes(x = x, y = y, color = K))

BIMODAL DENSITY

  • Related