Home > front end >  Average over time series data with custom time slice window
Average over time series data with custom time slice window

Time:05-18

I am working with some tennis ranking data in R, that gives the evolution of tennis rankings over time of all the players on the ATP tour.

An example of the data I am using can be found here, giving the rankings data from 2000's: enter image description here

CodePudding user response:

Try out roll functions from the zoo package

library(zoo)
set.seed(2137)
my_data <- rnorm(100)
zoo::rollmean(my_data, 10)

Depending on data types and time frequency of your data it could look like this

rankings %>%  
  group_by(player) %>% 
  summarise(
    meanRanking = zoo::rollmean(rank, k = 2, na.rm = TRUE),
) 

CodePudding user response:

Your data are reasonably large, and data.table can help quite a bit with speed. Here is an approach that is very fast, and it uses a flexible function f(s,e,p,u), which allows you to pass in any start (s) or end (e) date, an integer period (e.g. 2 for 2 years), and time unit (u) which takes values "y", "m", or "d", for years, months, days, respectively

f <- function(s,e,p, u=c("y","m","d")) {
  u=match.arg(u)
  uf = list("y"=years,"m"=months,"d"=days)
  data.table(s = seq(as.Date(s), as.Date(e),by=paste(p,u)))[,`:=`(e=s %m % uf[[u]](p), period=1:.N)]
}

Then you just apply the function to dt in a non-equi join

dt[f("2000-01-01", "2008-01-01",2), on=.(ranking_date>=s, ranking_date<=e)] %>% 
  .[,.(ranking=mean(rank,na.rm=T)), by=.(player,period )]

Output:

       player period     ranking
    1: 101736      1    2.769231
    2: 102338      1    5.211538
    3: 101948      1    4.730769
    4: 103017      1   23.259615
    5: 102856      1    2.538462
   ---                          
13543: 105996      5 1780.500000
13544: 105050      5 1665.333333
13545: 105757      5 1781.000000
13546: 121555      5 1328.500000
13547: 106018      5 1508.000000

For 3 month windows, just call f with p=3, and u="m"

dt[f("2000-01-01", "2008-01-01",3, "m"), on=.(ranking_date>=s, ranking_date<=e)] %>% 
  .[,.(ranking=mean(rank,na.rm=T)), by=.(player,period )]

Output (for 3 month windows)

       player period     ranking
    1: 101736      1    1.000000
    2: 102338      1    2.666667
    3: 101948      1    2.333333
    4: 103017      1    4.000000
    5: 102856      1    5.500000
   ---                          
62491: 105517     33 1502.000000
62492: 104405     33 1588.000000
62493: 104576     33 1588.000000
62494: 105500     33 1679.000000
62495: 108698     33 1844.000000
  • Related