I've would like to create a column to detect outliers in a large dataset. A sample of the dataset is
Item | Cost |
---|---|
Book A | 5.0 |
Book A | 3.5 |
Book A | 12.0 |
Book B | 6.0 |
Book B | 8.0 |
Book C | 3.0 |
Book B | 6.0 |
Book C | 3.5 |
Book A | 3.8 |
Book B | 13.0 |
Book A | 5.1 |
Book B | 7.0 |
Book A | 11.5 |
Book C | 3.8 |
Book A | 5.5 |
Book A | 6.5 |
Book B | 13.5 |
Book A | 5.5 |
Book C | 3.5 |
Book A | 1.2 |
df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L,
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A",
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12,
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5,
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))
I would like to detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column as follows
Item | Cost | Outlier (Y/N) |
---|---|---|
Book A | 5.0 | N |
Book A | 3.5 | N |
Book A | 12.0 | Y |
Book B | 6.0 | N |
Book B | 8.0 | N |
Book C | 3.0 | N |
Book B | 6.0 | N |
Book C | 3.5 | N |
Book A | 3.8 | N |
Book B | 13.0 | Y |
Book A | 5.1 | N |
Book B | 7.0 | N |
Book A | 11.5 | Y |
Book C | 3.8 | N |
Book A | 5.5 | N |
Book A | 6.5 | N |
Book B | 13.5 | Y |
Book A | 5.5 | N |
Book C | 3.5 | N |
Book A | 1.2 | Y |
Thanks in advance!
CodePudding user response:
library(dplyr)
df %>%
summarise(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier offset,
lower_value = outlier - offset) %>%
bind_cols(df) %>%
transmute(Item, Cost,
Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
# Item Cost Outlier
#1 Book A 5.0 N
#2 Book A 3.5 N
#3 Book A 12.0 Y
#4 Book B 6.0 N
#5 Book B 8.0 N
#6 Book C 3.0 N
#7 Book B 6.0 N
#8 Book C 3.5 N
#9 Book A 3.8 N
#10 Book B 13.0 Y
#11 Book A 5.1 N
#12 Book B 7.0 N
#13 Book A 11.5 Y
#14 Book C 3.8 N
#15 Book A 5.5 N
#16 Book B 6.5 N
#17 Book B 13.5 Y
#18 Book A 5.5 N
#19 Book C 3.5 N
#20 Book A 1.2 Y
If you want to find the outlier for each Item
separately.
df %>%
group_by(Item) %>%
summarise(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier offset,
lower_value = outlier - offset) %>%
left_join(df, by = 'Item') %>%
transmute(Item, Cost,
Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))