Hi im trying to make a simple script that will find outliers in datasets:
The class looks like this:
namespace App\Classes;
class PercentageOutlier {
private int $percentage;
private array $dataset;
private ?string $key;
private array $normalitiesDataset = [];
private array $outlierDataset = [];
// todo - make som error handling for providing multidimensional array without a key
// count($array) == count($array, COUNT_RECURSIVE)
public function __construct(int $percentage, array $dataset, ?string $key = null) {
$this->percentage = $percentage;
$this->dataset = $dataset;
$this->key = $key;
}
// Method to detect outlier tickers
public function detectOutliers(): void
{
$average = $this->key === null
? array_sum($this->dataset) / count($this->dataset)
: array_sum(array_column($this->dataset, $this->key)) / count($this->dataset);
foreach ($this->dataset as $key => $data) {
$rate = $this->key === null ? $data : $data[$this->key];
// Calculate the percentage difference between the current datapoint and the average datapoint
$percentageDifference = abs(($rate - $average) / $average * 100);
if ($percentageDifference > $this->percentage) {
$this->outlierDataset[$key] = $data;
} else {
$this->normalitiesDataset[$key] = $data;
}
}
}
public function hasOutliers(): bool
{
return $this->countOutliers() > 0;
}
public function countOutliers(): int
{
return count($this->outlierDataset);
}
public function hasNormalities(): bool
{
return $this->countNormalities() > 0;
}
public function countNormalities(): int
{
return count($this->normalitiesDataset);
}
public function getOutliers(): array
{
return $this->outlierDataset;
}
public function getNormalities(): array
{
return $this->normalitiesDataset;
}
}
The script works very well for a dataset looking like this:
$dataset = [
0 => 1303.38028120000000000000,
1 => 1303.57226533330000000000,
2 => 1303.47627326660000000000,
3 => 1363.60716566840000000000,
4 => 1781.57864604160000000000,
5 => 1314.34978900860000000000,
];
It will return a outliers array looking like this which is correct:
array:1 [
4 => 1781.5786460416
]
However if i provide a slightly modified dataset which has 2 outliers at index 3 and 4:
$dataset = [
0 => 1303.38028120000000000000,
1 => 1303.57226533330000000000,
2 => 1303.47627326660000000000,
3 => 1763.60716566840000000000,
4 => 1781.57864604160000000000,
5 => 1314.34978900860000000000,
];
It will return the whole array as outliers because the average is now too far away from all datapoints
Im calling the script with a percentage value of 10 like so:
$outlier = new PercentageOutlier(10, $dataset);
$outlier->detectOutliers();
$normalities = $outlier->getNormalities();
$outliers = $outlier->getOutliers();
echo 'we have normalities: '.$outlier->countNormalities();
echo 'we have outliers: '.$outlier->countOutliers();
dump($normalities);
dd($outliers);
Can someone help me in the right direction? I've been looking at stuff like IQR/LOF algorithms but i don't seem to be math wiz enough to make something that come close to working :)
CodePudding user response:
The solution with the best odds calculates the absolute differences of the value from the median and deletes all values whose differences are greater than a certain value epsilon. Calculating the median is easy:
/*
* @return: Median from array data, false if error
* @param : Array of data
*/
public static function median(array $data)
{
if(($count = count($data)) < 1) return false;
sort($data, SORT_NUMERIC);
$mid = (int)($count/2);
if($count % 2) return $data[$mid];
return ($data[$mid] $data[$mid-1])/2;
}
Determining a good value for the epsilon is more difficult. This often still works for artificial datasets. Example for the first dataset:
$median = 1308.961027171;
$eps = 100;
$outliers = [];
foreach($dataset as $key => $value){
if(abs($value - $median) > $eps) {
$outliers[$key] = $value;
}
}
//$outliers: array(1) { [4]=> float(1781.5786460416) }
The problems arise when the algorithm is to be applied to real series of measurements.