Home > database >  Optimize time series aggregation in R
Optimize time series aggregation in R

Time:12-17

I have big time-series (DateTime, Value, Instance), and before visualizing I need to aggregate data using max value per interval (15 minutes in my example) for each Instance.

I did not find native aggregation functions in R so I created 2 custom functions using celling_data and cut methods. See my example:

library(tidyverse)
library(lubridate)


agg_fun_1 <- function (data, aggregation_period = 900) {

agg_period <- paste(aggregation_period, "secs")

agg_data <- data %>%  
    group_by(across(-c(Value, datetime)),  
             datetime = as.POSIXct(cut(datetime, agg_period))   aggregation_period) %>%
    summarise (Value = max(Value) , .groups = "drop") %>% 
    mutate(Value = ifelse(is.infinite(Value), NA, Value))

return (agg_data)

}


agg_fun_2 <- function (data, aggregation_period = "15 mins") {

    agg_data <- data %>% 
        group_by(across(-c(Value, datetime)), datetime = ceiling_date (datetime, aggregation_period))
    
    suppressWarnings(
        agg_data <- agg_data %>% 
            summarise(Value = max(Value,  na.rm = F), .groups = "drop") %>% 
            mutate(Value = ifelse(is.infinite(Value), NA, Value))
    )   
    
    return (agg_data)
    
}


set.seed(42)

example_data <- tibble()

for(i in 1:256) {
    
    example_data <- rbind(example_data,
        
        data.frame( Instance = rep(i,20002),
                     datetime = seq.POSIXt(as.POSIXct("2020-12-26 10:00:00"), as.POSIXct("2020-12-26 10:00:00")   15*20001, "15 sec"),
                     Value = sample(0:1000, 20002, replace=TRUE)
                     )   
    )
    
}

gc()

start_time <- Sys.time()

agg_fun_1(example_data)

end_time <- Sys.time()
end_time - start_time

gc()

start_time <- Sys.time()

agg_fun_2(example_data)

end_time <- Sys.time()
end_time - start_time
  • agg_fun_1 execution time is 2.3 minutes, RAM usage - around 702 MB on my laptop.
  • agg_fun_2 execution time is 1.9 minutes, RAM usage - around 930 MB on my laptop.

In a real environment, I will run 8 R scripts in parallel, and my data could be 3-5 times larger than I used for example. In that case, I could face a lack of resources.

Are there any ways to optimize RAM usage and execution time of my function or maybe there's a better aggregation function?

CodePudding user response:

Most of the time is spent in computing you aggregated time (cut or ceiling_date). Try to compute it without the rest of your code to see that.

Direct computation using datetime as number of seconds since origin is way faster (especially if your agg_period can be easily given as number of seconds)

Here using data.table instead of tidyverse (but should do the same with tidy).

# data.frame on steroids
library(data.table)

# transform you data as data.table
example_data <- setDT(example_data)

agg_period <- 15*60 # seconds

agg_data <- example_data[, 
  # group by aggregated datetime
  by=as_datetime(floor(as.integer(datetime)/agg_period)*agg_period),
  # compute Value as max Value of group
  .(Value=max(Value))
]

# replace infinite Value with NA (:= means inplace)
agg_data[is.infinite(Value), Value:=NA]
  • Related