I have a data.table in which I want to detect the presence of outliers (according to skewness and kurtosis) and, if found, correct them.
To this purpose, when an outlier is detected and var
is the highest value, I want to set the highest value in var
to be equal to the second highest. Below is a minimal (almost) working example of my code:
`%>%` <- fastpipe::`%>>%`
country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)
melted_data <- data.table(country, year, var)
melted_data %>%
.[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
.[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
.[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%
# Ranks
.[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
.[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year] %>%
# Identify and correct outliers
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
.[rank_low_first==1, lowest1 := var, by=year] %>%
.[rank_low_first==2, lowest2 := var, by=year] %>%
.[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]
What I am trying to achieve is all in the last row. However, this does not work because the values highest1
and highest2
do not span the whole year
group (edit: see also screenshot below). I think the solution would be to modify the following lines
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
so that highest1
and highest2
are copied to all rows in that year. How can I achieve that?
EDIT: Solved the first part of my problem by rewriting the last lines this way:
.[, highest1 := head(sort(var, decreasing=TRUE), 1), by=year] %>%
.[, highest2 := head(unique(sort(var, decreasing=TRUE)), 2)[2], by=year] %>% # unique() is to account for ties
.[, lowest1 := head(sort(var, decreasing=FALSE), 1), by=year] %>%
.[, lowest2 := head(unique(sort(var, decreasing=FALSE)), 2)[2], by=year] %>%
.[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year] %>% # This row (and following one) do not give desired result
.[outliers==TRUE & skew<0 & var==lowest1, var<-lowest2, by=year]
First four rows give me what I want, but the last two do not.
CodePudding user response:
In the identify and correct outliers section, I believe you can make the following change:
f <- function(v,r,i) v[r==i & !is.na(r)]
melted_data[, `:=`(
highest1 = f(var,rank_high_first,1),highest2=f(var,rank_high_first,2),
lowest1 = f(var,rank_low_first,1),lowest2=f(var,rank_low_first,2)
),by=year]
Also, I was wondering about your initial definition of outliers
. Should the parentheses be added as below?:
.[, outliers := kurt>1 | (kurt>3.5 & abs(skew)>2), by=year]
The problem you are seeing in the last two lines, I believe can be solved by:
.[outliers==TRUE & skew>0 & var==highest1, var:=highest2] %>%
.[outliers==TRUE & skew<0 & var==lowest1, var:=lowest2]
Note: You do not need by=year
here, and you should use :=
instead of <-