I have an ordered data set of decimal numbers. This data is always similar - but not always the same. The expected data is a few, 0 - 5 large numbers, followed by several (10 - 90) average numbers then follow by smaller numbers. There are cases where a large number may be mixed into the average numbers' See the following arrays.
let expectedData = [35.267,9.267,9.332,9.186,9.220,9.141,9.107,9.114,9.098,9.181,9.220,4.012,0.132];
let expectedData = [35.267,32.267,9.267,9.332,9.186,9.220,9.141,9.107,30.267,9.114,9.098,9.181,9.220,4.012,0.132];
I am trying to analyze the data by getting the average without high numbers on front and low numbers on back. The middle high/low are fine to keep in the average. I have a partial solution below. Right now I am sort of brute forcing it but the solution isn't perfect. On smaller datasets the first average calculation is influenced by the large number.
My question is: Is there a way to handle this type of problem, which is identifying patterns in an array of numbers?
My algorithm is:
- Get an average of the array
- Calculate an above/below average value
- Remove front (n) elements that are above average
- remove end elements that are below average
- Recalculate average
In JavaScript I have: (this is partial leaving out below average)
let total= expectedData.reduce((rt,cur)=> {return rt cur;}, 0);
let avg = total/expectedData.length;
let aboveAvg = avg*0.1 avg;
let remove = -1;
for(let k=0;k<expectedData.length;k ) {
if(expectedData[k] > aboveAvg) {
remove=k;
} else {
if(k==0) {
remove = -1;//no need to remove
}
//break because we don't want large values from middle removed.
break;
}
}
if(remove >= 0 ) {
//remove front above average
expectedData.splice(0,remove 1);
}
//remove belows
//recalculate average
CodePudding user response:
I believe you are looking for some outlier detection Algorithm. There are already a bunch of questions related to this on Stack overflow.
However, each outlier detection algorithm has its own merits.
Here are a few of them
https://mathworld.wolfram.com/Outlier.html
High outliers are anything beyond the 3rd quartile 1.5 * the inter-quartile range (IQR)
Low outliers are anything beneath the 1st quartile - 1.5 * IQR
Grubbs's test
- You can check how it works for your expectations here
Apart from these 2, the is a comparison calculator here . You can visit this to use other Algorithms per your need.
CodePudding user response:
I would have tried to get a sliding window coupled with an hysteresis / band filter in order to detect the high value peaks, first.
Then, when your sliding windows advance, you can add the previous first value (which is now the last of analyzed values) to the global sum, and add 1 to the number of total values.
When you encounter a peak (=something that causes the hysteresis to move or overflow the band filter), you either remove the values (may be costly), or better, you set the value to NaN
so you can safely ignore it.
You should keep computing a sliding average within your sliding window in order to be able to auto-correct the hysteresis/band filter, so it will reject only the start values of a peak (the end values are the start values of the next one), but once values are stabilized to a new level, values will be kept again.
The size of the sliding window will set how much consecutive "stable" values are needed to be kept, or in other words how much UNstable values are rejected when you reach a new level.
CodePudding user response:
For that, you can check the mode of the values (rounded) and then take all the numbers in a certain range around the mode. That range can be taken from the data itself, for example by taking the 10%
of the max - min
value. That helps you to filter your data. You can select the percent that fits your needs. Something like this:
let expectedData = [35.267,9.267,9.332,9.186,9.220,9.141,9.107,9.114,9.098,9.181,9.220,4.012,0.132];
expectedData.sort((a, b) => a - b);
/// Get the range of the data
const RANGE = expectedData[ expectedData.length - 1 ] - expectedData[0];
const WINDOW = 0.1; /// Window of selection 10% from left and right
/// Frequency of each number
let dist = expectedData.reduce((acc, e) => (acc[ Math.floor(e) ] = (acc[ Math.floor(e) ] || 0) 1, acc), {});
let mode = Object.entries(dist).sort((a, b) => b[1] - a[1])[0][0];
let newData = expectedData.filter(e => mode - RANGE * WINDOW <= e && e <= mode RANGE * WINDOW);
console.log(newData);