Home > Enterprise >  Non-overlapping sliding window based on index
Non-overlapping sliding window based on index

Time:11-03

For a data.frame, df, which has an index column and a value column, I would like to calculate e.g.the mean of the values in non-overlapping sliding windows, with the window size based on the units in the index column (for example, windows which cover 10 units in the index).

There is runner::runner and slider::slide_index which allow you to slide in windows based on an index column, but I don't see a way to make the windows non-overlapping.

df = structure(list(V3 = c(17054720L, 17075353L, 17087656L, 17099107L, 
17152611L, 17154984L, 17178213L, 17256231L, 17264565L, 17280822L, 
17281931L, 17285949L, 17289118L, 17294251L, 17301217L, 17301843L, 
17304246L, 17304887L, 17306104L, 17310741L, 17312596L, 17315102L, 
17315503L, 17317233L, 17318150L, 17319156L, 17326181L, 17326432L, 
17394989L, 17395610L, 17396612L, 17397875L, 17398508L, 17398800L, 
17398812L, 17399211L, 17405173L, 17407349L, 17407566L, 17409897L, 
17410373L, 17412216L, 17412806L, 17414103L, 17414640L, 17415572L, 
17426401L, 17427037L, 17429384L, 17429434L, 17433210L, 17434084L, 
17436846L, 17441524L, 17442154L, 17443131L, 17445502L, 17446157L, 
17446914L, 17450515L, 17452966L, 17462185L, 17467411L, 17467684L, 
17470779L, 17475921L, 17488195L, 17489577L, 17489890L, 17490932L, 
17492203L, 17492452L, 17493792L, 17494101L, 17494547L, 17524203L, 
17525584L, 17525970L, 17529814L, 17541673L, 17545859L, 17557144L, 
17567699L, 17575800L, 17580394L, 17580813L, 17585441L, 17586471L, 
17587680L, 17587975L, 17589209L, 17589246L, 17593685L, 17594915L, 
17597462L, 17599844L, 17603801L, 17605824L, 17611515L, 17615213L
), V1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 1L)), row.names = c(NA, -100L), class = "data.frame")

CodePudding user response:

What about something like this:

df <- data.frame(
  index = 0:99, 
  value = 1:100)

df %>% 
  mutate(window = floor(index/10)) %>% 
  group_by(window) %>% 
  summarise(value = mean(value), 
            n = n())
# # A tibble: 10 × 3
#   window value     n
#    <dbl> <dbl> <int>
# 1      0   5.5    10
# 2      1  15.5    10
# 3      2  25.5    10
# 4      3  35.5    10
# 5      4  45.5    10
# 6      5  55.5    10
# 7      6  65.5    10
# 8      7  75.5    10
# 9      8  85.5    10
# 10      9  95.5    10

In the answer above, you divide the index by the window width and wrap that in the floor() function so it rounds all observations down to the nearest integer. This assumes that the index values are consecutive integers. An alternative, if the index is not sequential is to make it so, like what follows:

df <- data.frame(
  index = sample(0:1000, 100, replace=FALSE), 
  value = 1:100)

df %>% 
  arrange(index) %>% 
  mutate(obs = seq_along(index)-1, 
         window = floor(obs/10)) %>% 
  group_by(window) %>% 
  summarise(value = mean(value), 
            n = n())
# A tibble: 10 × 3
#   window value     n
#    <dbl> <dbl> <int>
# 1      0  38.2    10
# 2      1  50.1    10
# 3      2  63.6    10
# 4      3  64.9    10
# 5      4  44      10
# 6      5  41.5    10
# 7      6  65.4    10
# 8      7  45.1    10
# 9      8  48.9    10
# 10      9  43.3    10
  • Related