I have a large numeric object which is the output of an isolation forest model.
I wish to subset the output of the model to find the top N outliers. Using the example code from here I can find the top outlier but I wish to find more than one outlier
My data looks as follows:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(isotree)
set.seed(1)
m <- 100
n <- 2
X <- matrix(rnorm(m * n), nrow = m)
# ADD CLEAR OUTLIER TO THE DATA
X <- rbind(X, c(3, 3))
# TRAIN AN ISOLATION FOREST MODEL
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
# MAKE A PREDICTION TO SCORE EACH ROW
pred <- predict(iso, X)
The max outlier can be subset using the following
X[which.max(pred), ]
dplyr::slice_max
doesn't appear to be compatible with my large numeric object.
Any suggestions that would allow me to subset my data to find the top N outliers would be greatly appreciated.
CodePudding user response:
Does this solve your problem?
library(tidyverse)
#install.packages("isotree")
library(isotree)
set.seed(1)
m <- 100
n <- 2
X <- matrix(rnorm(m * n), nrow = m)
# ADD CLEAR OUTLIER TO THE DATA
X <- rbind(X, c(3, 3))
# TRAIN AN ISOLATION FOREST MODEL
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)
# MAKE A PREDICTION TO SCORE EACH ROW
pred <- predict(iso, X)
X[which.max(pred), ]
#> [1] 3 3
# Perhaps this?
data.frame(X, "pred" = pred) %>%
slice_max(order_by = pred, n = 3)
#> X1 X2 pred
#> 1 3.000000 3.0000000 0.7306871
#> 2 -1.523567 -1.4672500 0.6496666
#> 3 -2.214700 -0.6506964 0.5982211
# Or maybe this?
data.frame(X, "pred" = pred) %>%
slice_max(order_by = X1, n = 3)
#> X1 X2 pred
#> 1 3.000000 3.0000000 0.7306871
#> 2 2.401618 0.4251004 0.5014570
#> 3 2.172612 0.2075383 0.4811756
Created on 2022-04-06 by the reprex package (v2.0.1)