Home > Mobile >  Getting counts from geom_density()
Getting counts from geom_density()

Time:09-25

I have a series of numbers:

tmp<- c(round(seq(0, 12000, ((12000 - 0) / round(1500 * .05)))), 
                 round(seq(12000, 18900, ((18900 - 12000) / round(1500 * .1)))),
                 round(seq(18900, 23300, ((23300 - 18900) / round(1500 * .1)))),
                 round(seq(23300, 28100, ((28100 - 23300) / round(1500 * .1)))),
                 round(seq(28100, 33500, ((33500 - 28100) / round(1500 * .1)))),
                 round(seq(33500, 40000, ((40000 - 33500) / round(1500 * .1)))),
                 round(seq(40000, 47700, ((47700 - 40000) / round(1500 * .1)))),
                 round(seq(47700, 56500, ((56500 - 47700) / round(1500 * .1)))),
                 round(seq(56500, 68300, ((68300 - 56500) / round(1500 * .1)))),
                 round(seq(68300, 94200, ((94200 - 68300) / round(1500 * .1)))),
                 round(seq(94200, 200000, ((200000 - 94200) / round(1500 * .05)))))

Now I can use geom_density to get the shape of the distribution. How do I get the approximate count of the amount of tmp between two specific values of tmp, based on that density shape?

So for example I could count the amount of values in tmp between 10050 and 10100, based on the actual series. But I would like to count the amount of values based on the smoothed histogram (the density), which is not as linear as the actual series.

CodePudding user response:

I don't know if I interpret well. The following code will count the rows from 'tmp' according to the density estimation instead of the actual distribution. Estimation is a probability density estimation, so you have to multiply it:

  • by the width of each bin of estimation, to get the probability value around each point of estimation

  • by the total numbers of rows to get an estimation of the number of rows in a given range (here in the example, 10000 to 20000 exclusive).

'density' is the function that is called by 'geom_density' to get the points to draw.

> k <- density(tmp); sum(k$y[which(k$x>10000 & k$x<20000)])*(k$x[2]-k$x[1])*length(tmp)
[1] 199.3722

> length(which(tmp>10000 & tmp<20000))
[1] 202
  • Related