I am looking to plot the density curves within the distribution of my data. I have heard (not not expert) on something called lognormal distribution? My data is the following:
data<- data.frame(
Day=c(1,2,3,4,5,6,7,8,9,10),
Variable=c(3,5,20,10,8,18,23,21,16,12))
ggplot(data)
geom_bar(aes(y=Variable, x=Day),stat="identity", colour="black")
labs(title= "",x="",y=expression('Variable')) theme_classic()
And would like something like this:
CodePudding user response:
I think you're getting a bit confused between a bar chart and a histogram. You have a bar chart, which in your case is showing the change in one variable on the y axis against time on the x axis.
A histogram looks like a bar chart, but it shows the distribution of a single variable throughout its range. This has the value of the variable on the x axis, and the frequency with which that variable was found on the y axis. It makes sense to plot a density curve over a histogram, but not over a time series. If you are looking for a similar visual effect to the plot shown, the best you can get is probably to plot a moving average along with the bars, perhaps something like this:
ggplot(data, aes(Day, Variable))
geom_col(colour = "black")
geom_smooth(se = FALSE, color = "black")
labs(title = "", x = "", y = expression('Variable'))
theme_classic()
CodePudding user response:
I am assuming here that your data are frequencies, therefore in order to simplify subsequent operations, first transform them into a vector of single observations.
library(ggplot2)
library(tidyr)
library(dplyr)
long <- with(data, rep(Day, times = Variable))
long[1:20]
[1] 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3
Now using ggplot
you can plot the histogram and density estimate:
data.frame(Day = long) |>
ggplot()
geom_bar(aes(x = Day), stat = "count")
geom_density(aes(x = Day, after_stat(count)))
To fit a mixture of two gaussian distributions (as in the reference example), you can use mclust
. Note that this is shown only as an example and it is probably not the best way to treat your data. Here, the parameter G
specifies the number of models:
library(mclust)
fit <- densityMclust(long, G = 2)
Next, the data are prepared for plotting using ggplot in several steps:
- the density estimates are produced for both gaussian models in a suitable range of
x
usingpredict
- the density values are scaled to match the proportion of observations predicted to be in either model
- the dataframe is rearranged pivoting on the
x
x <- seq(0, max(long) 1, by = 0.1)
dens <- predict(fit, x, what = "cdens") |>
apply(1, function(z) z*table(fit$classification)) |>
t() |>
data.frame() |>
cbind(x = x) |>
pivot_longer(cols = c(X1, X2),
names_to = "K",
values_to = "y")
Finally, the densities are layered upon the barplot, by specifying in geom_line
that the source of data is dens
:
data.frame(Day = long) |>
ggplot()
geom_bar(aes(x = Day), stat = "count")
geom_line(data = dens,
aes(x = x, y = y, color = K))