Home > Enterprise >  Statistical analysis of distributed data values in Java
Statistical analysis of distributed data values in Java

Time:12-23

I am writing a program in Java that outputs a List<Double> of distances that roughly follow a bell curve distribution. From this data, I need to generate two values A and B that follow the distribution at a particular standard deviation from the mean X, one above the mean and one below the mean. The distribution may not be symmetrical but I am content to assume that it is for my purposes. These values A and B would be better than my current method of taking the min and max of the dataset, which is very vulnerable to be skewed by random outliers, and so is not always representative of a specific probability from the distribution. How would I generate these values, A and B? Should I be asking this in the Stats stack exchange? Any help is greatly appreciated!

CodePudding user response:

Should I be asking this in the Stats stack exchange?

Nah, we can do it here!

The Statistics

First off, we need to establish what we want to do. A and B are the values on opposite sides of the mean, with a particular standard deviation from it.

  • Recall, the standard deviation, is simply the square root of the variance
  • The variance, is calculated by sum((x[i] - mean)^2) / x.length
  • Thus, we also need the mean, which is sum(x[i]) / x.length

With the standard deviation calculated, if you multiply it with 1, it will be the distance from the mean to B, so B would be that value plus the mean. Use negative for the value of A (if that's what's below the mean).

The code

So, we have established that the data type for the statistical data is a List, so I will adapt it to the use of Lists.

First we need to loop over the list of data, let's call that List x. And I'm assuming it is already populated with data.

We also need some variables, let's define the mean: double mean, the standard deviation: double stdev and two helper variables to keep the sums: double sqr_sum and double data_sum.

Now, we will compute the mean first:

for (int i; i < x.size(); i  ){
    data_sum  = x[i];
}
mean = data_sum / x.size();

Finally, we should have everything to begin calculating the sum of squares, and eventually the variance! I will also define another variable "variance" (data_var) here to make it easier.

for (int i; i < x.size(); i  ){
    sqr_sum  = Math.pow(x[i] - mean, 2);
}
data_var = sqr_sum / x.size(); // Note, in statistics, depending on what data this is, you should use x.size() for populations, but x.size()-1 for sample data.

stdev = Math.sqrt(data_var);

... and there you have it! The standard deviation of the x data. If you want to get B (or A), you could simply use:

double dev_A = -1; // How far from the mean we want A to be.
double dev_B = 1; // How far from the mean we want B to be.

double a = dev_A * stdev   mean;
double b = dev_B * stdev   mean;

Hope this helps!

  • Related