I have the following dataset:
structure(list(media = c(56.257, NA, NA,
56.256, 56.267, NA, NA, 56.265, 56.262, 56.259, 56.265,
56.263, 56.265, 56.262, 56.264, NaN, 56.268, 56.265, 560.26,
5600.2, 56.265, 56.262, 56.263, 56.264, 56.263, 56.265, 56.271,
56.271, 56.28, 56.281), cycle_rounded = structure(c(1604487600,
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600,
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600,
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600,
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600,
1604487600, 1604487600, 1604487600, 1604489400, 1604489400), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -30L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x000001d9d6a21f90>)
I would like to 1) Get values in media
variable, above and below the IQR (interquartile range - outliers) of all values in media
and convert them to NaN. 2) Convert NA to NaN, because when I will plot geom_line()
it automatically gives me a break in the line, that is what I need.
I try something like it: mutate(media = case_when(media > 1.5*IQR(media, na.rm = T) ~ NaN))
But won't works.
Thanks for appreciate all help.
CodePudding user response:
IRQ
is a range, so a difference of values. I think you need to include the median as well, from which the IQR is relevant.
For this code, I generally would use the IQR for both above and below the median of the values. If you really need to remove just above it, then you can remove the between(..)
logic and simplify it.
data.table
DT[
with(list(med = median(media, na.rm = TRUE), iq = IQR(media, na.rm = TRUE)),
!between(media, med - 1.5*iq, med 1.5*iq)), media := NaN]
DT
# media cycle_rounded
# <num> <POSc>
# 1: NaN 2020-11-04 11:00:00
# 2: NA 2020-11-04 11:00:00
# 3: NA 2020-11-04 11:00:00
# 4: NaN 2020-11-04 11:00:00
# 5: 56.267 2020-11-04 11:00:00
# 6: NA 2020-11-04 11:00:00
# 7: NA 2020-11-04 11:00:00
# 8: 56.265 2020-11-04 11:00:00
# 9: 56.262 2020-11-04 11:00:00
# 10: 56.259 2020-11-04 11:00:00
# ---
# 21: 56.265 2020-11-04 11:00:00
# 22: 56.262 2020-11-04 11:00:00
# 23: 56.263 2020-11-04 11:00:00
# 24: 56.264 2020-11-04 11:00:00
# 25: 56.263 2020-11-04 11:00:00
# 26: 56.265 2020-11-04 11:00:00
# 27: 56.271 2020-11-04 11:00:00
# 28: 56.271 2020-11-04 11:00:00
# 29: NaN 2020-11-04 11:30:00
# 30: NaN 2020-11-04 11:30:00
dplyr
DT %>%
mutate(
media = if_else(
between(media,
median(media, na.rm = TRUE) - 1.5*IQR(media, na.rm = TRUE),
median(media, na.rm = TRUE) 1.5*IQR(media, na.rm = TRUE)),
media, NaN)
)
Calculating the median and IQR multiple times seems a bit unnecessary, so we can preassign outside the call. However, if you're ever dealing with grouping, then that won't work, so here's an alternative:
DT %>%
mutate(
media = with(
list(med = median(media, na.rm = TRUE), iq = IQR(media, na.rm = TRUE)),
if_else(between(media, med - 1.5*iq, med 1.5*iq), media, NaN))
)
# media cycle_rounded
# <num> <POSc>
# 1: NaN 2020-11-04 11:00:00
# 2: NA 2020-11-04 11:00:00
# 3: NA 2020-11-04 11:00:00
# 4: NaN 2020-11-04 11:00:00
# 5: 56.267 2020-11-04 11:00:00
# 6: NA 2020-11-04 11:00:00
# 7: NA 2020-11-04 11:00:00
# 8: 56.265 2020-11-04 11:00:00
# 9: 56.262 2020-11-04 11:00:00
# 10: 56.259 2020-11-04 11:00:00
# ---
# 21: 56.265 2020-11-04 11:00:00
# 22: 56.262 2020-11-04 11:00:00
# 23: 56.263 2020-11-04 11:00:00
# 24: 56.264 2020-11-04 11:00:00
# 25: 56.263 2020-11-04 11:00:00
# 26: 56.265 2020-11-04 11:00:00
# 27: 56.271 2020-11-04 11:00:00
# 28: 56.271 2020-11-04 11:00:00
# 29: NaN 2020-11-04 11:30:00
# 30: NaN 2020-11-04 11:30:00
or if you prefer case_when
:
DT %>%
mutate(
media = with(
list(med = median(media, na.rm = TRUE), iq = IQR(media, na.rm = TRUE)),
case_when(between(media, med - 1.5*iq, med 1.5*iq) ~ media, TRUE ~ NaN))
)