Home > Software engineering >  Convert outliers and NA in a dataframe to NaN
Convert outliers and NA in a dataframe to NaN

Time:09-15

I have the following dataset:

structure(list(media = c(56.257, NA, NA, 
56.256, 56.267, NA, NA, 56.265, 56.262, 56.259, 56.265, 
56.263, 56.265, 56.262, 56.264, NaN, 56.268, 56.265, 560.26, 
5600.2, 56.265, 56.262, 56.263, 56.264, 56.263, 56.265, 56.271, 
56.271, 56.28, 56.281), cycle_rounded = structure(c(1604487600, 
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 
1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 1604487600, 
1604487600, 1604487600, 1604487600, 1604489400, 1604489400), class = c("POSIXct", 
"POSIXt"), tzone = "UTC")), row.names = c(NA, -30L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000001d9d6a21f90>)

I would like to 1) Get values in media variable, above and below the IQR (interquartile range - outliers) of all values in media and convert them to NaN. 2) Convert NA to NaN, because when I will plot geom_line() it automatically gives me a break in the line, that is what I need.

I try something like it: mutate(media = case_when(media > 1.5*IQR(media, na.rm = T) ~ NaN)) But won't works.

Thanks for appreciate all help.

CodePudding user response:

IRQ is a range, so a difference of values. I think you need to include the median as well, from which the IQR is relevant.

For this code, I generally would use the IQR for both above and below the median of the values. If you really need to remove just above it, then you can remove the between(..) logic and simplify it.

data.table

DT[
  with(list(med = median(media, na.rm = TRUE), iq = IQR(media, na.rm = TRUE)),
  !between(media, med - 1.5*iq, med   1.5*iq)), media := NaN]
DT
#      media       cycle_rounded
#      <num>              <POSc>
#  1:    NaN 2020-11-04 11:00:00
#  2:     NA 2020-11-04 11:00:00
#  3:     NA 2020-11-04 11:00:00
#  4:    NaN 2020-11-04 11:00:00
#  5: 56.267 2020-11-04 11:00:00
#  6:     NA 2020-11-04 11:00:00
#  7:     NA 2020-11-04 11:00:00
#  8: 56.265 2020-11-04 11:00:00
#  9: 56.262 2020-11-04 11:00:00
# 10: 56.259 2020-11-04 11:00:00
# ---                           
# 21: 56.265 2020-11-04 11:00:00
# 22: 56.262 2020-11-04 11:00:00
# 23: 56.263 2020-11-04 11:00:00
# 24: 56.264 2020-11-04 11:00:00
# 25: 56.263 2020-11-04 11:00:00
# 26: 56.265 2020-11-04 11:00:00
# 27: 56.271 2020-11-04 11:00:00
# 28: 56.271 2020-11-04 11:00:00
# 29:    NaN 2020-11-04 11:30:00
# 30:    NaN 2020-11-04 11:30:00

dplyr

DT %>%
  mutate(
    media = if_else(
      between(media,
        median(media, na.rm = TRUE) - 1.5*IQR(media, na.rm = TRUE),
        median(media, na.rm = TRUE)   1.5*IQR(media, na.rm = TRUE)),
      media, NaN)
  )

Calculating the median and IQR multiple times seems a bit unnecessary, so we can preassign outside the call. However, if you're ever dealing with grouping, then that won't work, so here's an alternative:

DT %>%
  mutate(
    media = with(
      list(med = median(media, na.rm = TRUE), iq = IQR(media, na.rm = TRUE)),
      if_else(between(media, med - 1.5*iq, med   1.5*iq), media, NaN))
  )
#      media       cycle_rounded
#      <num>              <POSc>
#  1:    NaN 2020-11-04 11:00:00
#  2:     NA 2020-11-04 11:00:00
#  3:     NA 2020-11-04 11:00:00
#  4:    NaN 2020-11-04 11:00:00
#  5: 56.267 2020-11-04 11:00:00
#  6:     NA 2020-11-04 11:00:00
#  7:     NA 2020-11-04 11:00:00
#  8: 56.265 2020-11-04 11:00:00
#  9: 56.262 2020-11-04 11:00:00
# 10: 56.259 2020-11-04 11:00:00
# ---                           
# 21: 56.265 2020-11-04 11:00:00
# 22: 56.262 2020-11-04 11:00:00
# 23: 56.263 2020-11-04 11:00:00
# 24: 56.264 2020-11-04 11:00:00
# 25: 56.263 2020-11-04 11:00:00
# 26: 56.265 2020-11-04 11:00:00
# 27: 56.271 2020-11-04 11:00:00
# 28: 56.271 2020-11-04 11:00:00
# 29:    NaN 2020-11-04 11:30:00
# 30:    NaN 2020-11-04 11:30:00

or if you prefer case_when:

DT %>%
  mutate(
    media = with(
      list(med = median(media, na.rm = TRUE), iq = IQR(media, na.rm = TRUE)),
      case_when(between(media, med - 1.5*iq, med   1.5*iq) ~ media, TRUE ~ NaN))
  )
  • Related