Home > Software design >  Outlier Detection on common variable name in R
Outlier Detection on common variable name in R

Time:06-26

I've would like to create a column to detect outliers in a large dataset. A sample of the dataset is

Item Cost
Book A 5.0
Book A 3.5
Book A 12.0
Book B 6.0
Book B 8.0
Book C 3.0
Book B 6.0
Book C 3.5
Book A 3.8
Book B 13.0
Book A 5.1
Book B 7.0
Book A 11.5
Book C 3.8
Book A 5.5
Book A 6.5
Book B 13.5
Book A 5.5
Book C 3.5
Book A 1.2
df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L, 
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A", 
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12, 
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5, 
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))

I would like to detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column as follows

Item Cost Outlier (Y/N)
Book A 5.0 N
Book A 3.5 N
Book A 12.0 Y
Book B 6.0 N
Book B 8.0 N
Book C 3.0 N
Book B 6.0 N
Book C 3.5 N
Book A 3.8 N
Book B 13.0 Y
Book A 5.1 N
Book B 7.0 N
Book A 11.5 Y
Book C 3.8 N
Book A 5.5 N
Book A 6.5 N
Book B 13.5 Y
Book A 5.5 N
Book C 3.5 N
Book A 1.2 Y

Thanks in advance!

CodePudding user response:

library(dplyr)

df %>%
  summarise(outlier = mean(Cost), 
            offset = outlier * 0.6, 
            higher_value = outlier   offset, 
            lower_value = outlier - offset) %>%
  bind_cols(df) %>%
  transmute(Item, Cost, 
            Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))

#     Item Cost Outlier
#1  Book A  5.0       N
#2  Book A  3.5       N
#3  Book A 12.0       Y
#4  Book B  6.0       N
#5  Book B  8.0       N
#6  Book C  3.0       N
#7  Book B  6.0       N
#8  Book C  3.5       N
#9  Book A  3.8       N
#10 Book B 13.0       Y
#11 Book A  5.1       N
#12 Book B  7.0       N
#13 Book A 11.5       Y
#14 Book C  3.8       N
#15 Book A  5.5       N
#16 Book B  6.5       N
#17 Book B 13.5       Y
#18 Book A  5.5       N
#19 Book C  3.5       N
#20 Book A  1.2       Y

If you want to find the outlier for each Item separately.

df %>%
  group_by(Item) %>%
  summarise(outlier = mean(Cost), 
            offset = outlier * 0.6, 
            higher_value = outlier   offset, 
            lower_value = outlier - offset) %>%
  left_join(df, by = 'Item') %>%
  transmute(Item, Cost, 
            Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
  • Related