Home > Back-end >  Percentage Outlier Detection
Percentage Outlier Detection

Time:12-26

Hi im trying to make a simple script that will find outliers in datasets:

The class looks like this:

namespace App\Classes;

class PercentageOutlier {
    private int $percentage;
    private array $dataset;
    private ?string $key;

    private array $normalitiesDataset = [];
    private array $outlierDataset = [];

    // todo - make som error handling for providing multidimensional array without a key
    // count($array) == count($array, COUNT_RECURSIVE)
    public function __construct(int $percentage, array $dataset, ?string $key = null) {
        $this->percentage = $percentage;
        $this->dataset = $dataset;
        $this->key = $key;
    }

    // Method to detect outlier tickers
    public function detectOutliers(): void
    {
        $average = $this->key === null
             ? array_sum($this->dataset) / count($this->dataset)
             : array_sum(array_column($this->dataset, $this->key)) / count($this->dataset);


        foreach ($this->dataset as $key => $data) {
            $rate = $this->key === null ? $data : $data[$this->key];

            // Calculate the percentage difference between the current datapoint and the average datapoint
            $percentageDifference = abs(($rate - $average) / $average * 100);

            if ($percentageDifference > $this->percentage) {
                $this->outlierDataset[$key] = $data;
            } else {
                $this->normalitiesDataset[$key] = $data;
            }
        }
    }

    public function hasOutliers(): bool
    {
        return $this->countOutliers() > 0;
    }

    public function countOutliers(): int
    {
        return count($this->outlierDataset);
    }

    public function hasNormalities(): bool
    {
        return $this->countNormalities() > 0;
    }

    public function countNormalities(): int
    {
        return count($this->normalitiesDataset);
    }

    public function getOutliers(): array
    {
        return $this->outlierDataset;
    }

    public function getNormalities(): array
    {
        return $this->normalitiesDataset;
    }
}

The script works very well for a dataset looking like this:

$dataset = [
    0 => 1303.38028120000000000000,
    1 => 1303.57226533330000000000,
    2 => 1303.47627326660000000000,
    3 => 1363.60716566840000000000,
    4 => 1781.57864604160000000000,
    5 => 1314.34978900860000000000,
];

It will return a outliers array looking like this which is correct:

array:1 [
  4 => 1781.5786460416
]

However if i provide a slightly modified dataset which has 2 outliers at index 3 and 4:

$dataset = [
    0 => 1303.38028120000000000000,
    1 => 1303.57226533330000000000,
    2 => 1303.47627326660000000000,
    3 => 1763.60716566840000000000,
    4 => 1781.57864604160000000000,
    5 => 1314.34978900860000000000,
];

It will return the whole array as outliers because the average is now too far away from all datapoints

Im calling the script with a percentage value of 10 like so:

$outlier = new PercentageOutlier(10, $dataset);
$outlier->detectOutliers();

$normalities = $outlier->getNormalities();
$outliers = $outlier->getOutliers();

echo 'we have normalities: '.$outlier->countNormalities();
echo 'we have outliers: '.$outlier->countOutliers();

dump($normalities);
dd($outliers);

Can someone help me in the right direction? I've been looking at stuff like IQR/LOF algorithms but i don't seem to be math wiz enough to make something that come close to working :)

CodePudding user response:

The solution with the best odds calculates the absolute differences of the value from the median and deletes all values whose differences are greater than a certain value epsilon. Calculating the median is easy:

 /*
  * @return: Median from array data, false if error
  * @param : Array of data
  */
  public static function median(array $data)
  {
    if(($count = count($data)) < 1) return false;
    sort($data, SORT_NUMERIC);
    $mid = (int)($count/2);
    if($count % 2) return $data[$mid];
    return  ($data[$mid]   $data[$mid-1])/2;
  }

Determining a good value for the epsilon is more difficult. This often still works for artificial datasets. Example for the first dataset:

$median = 1308.961027171;
$eps = 100;
$outliers = [];
foreach($dataset as $key => $value){
  if(abs($value - $median) > $eps) {
     $outliers[$key] = $value;
  }
}
//$outliers: array(1) { [4]=> float(1781.5786460416) }

The problems arise when the algorithm is to be applied to real series of measurements.

  • Related