Home > Enterprise >  Subsetting a named numeric for top N values in R
Subsetting a named numeric for top N values in R

Time:04-06

I have a large numeric object which is the output of an isolation forest model.

I wish to subset the output of the model to find the top N outliers. Using the example code from here I can find the top outlier but I wish to find more than one outlier

My data looks as follows:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(isotree)

set.seed(1) 

m <- 100 

n <- 2 

X <- matrix(rnorm(m * n), nrow = m)

# ADD CLEAR OUTLIER TO THE DATA
X <- rbind(X, c(3, 3))

# TRAIN AN ISOLATION FOREST MODEL
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)

# MAKE A PREDICTION TO SCORE EACH ROW
pred <- predict(iso, X)

The max outlier can be subset using the following

X[which.max(pred), ]

dplyr::slice_max doesn't appear to be compatible with my large numeric object.

Any suggestions that would allow me to subset my data to find the top N outliers would be greatly appreciated.

CodePudding user response:

Does this solve your problem?

library(tidyverse)
#install.packages("isotree")
library(isotree)

set.seed(1) 

m <- 100 

n <- 2 

X <- matrix(rnorm(m * n), nrow = m)

# ADD CLEAR OUTLIER TO THE DATA
X <- rbind(X, c(3, 3))

# TRAIN AN ISOLATION FOREST MODEL
iso <- isolation.forest(X, ntrees = 10, nthreads = 1)

# MAKE A PREDICTION TO SCORE EACH ROW
pred <- predict(iso, X)

X[which.max(pred), ]
#> [1] 3 3

# Perhaps this?
data.frame(X, "pred" = pred) %>%
  slice_max(order_by = pred, n = 3)
#>          X1         X2      pred
#> 1  3.000000  3.0000000 0.7306871
#> 2 -1.523567 -1.4672500 0.6496666
#> 3 -2.214700 -0.6506964 0.5982211

# Or maybe this?
data.frame(X, "pred" = pred) %>%
  slice_max(order_by = X1, n = 3)
#>         X1        X2      pred
#> 1 3.000000 3.0000000 0.7306871
#> 2 2.401618 0.4251004 0.5014570
#> 3 2.172612 0.2075383 0.4811756

Created on 2022-04-06 by the reprex package (v2.0.1)

  • Related