Counting consecutive values and extract the length of each "event" (R)-CodePudding

The data frame that I'm working with is shown below:

yyyy   mm  dd hour
2011    5   1   10
2011    5   1   12
2011    5   1   13
2011    5   1   14
2011    5   1   15
2011    5   1   16
2011    5   1   17
2011    5   1   20
2011    5   2   11
2011    5   2   12
2011    5   2   13
2011    5   2   14
2011    5   2   15
2011    5   2   16
2011    5   2   17
2011    5   2   18
2011    5   3   10
[...]

I would like to add a new column that reports the number of consecutive hours, as shown below:

yyyy   mm  dd hour event
2011    5   1   10     1
2011    5   1   12     6
2011    5   1   13     6
2011    5   1   14     6
2011    5   1   15     6
2011    5   1   16     6
2011    5   1   17     6
2011    5   1   20     1
2011    5   2   11     8
2011    5   2   12     8
2011    5   2   13     8
2011    5   2   14     8
2011    5   2   15     8
2011    5   2   16     8
2011    5   2   17     8
2011    5   2   18     8
2011    5   3   10     1
[...]

The number of consecutive values ([mm], [dd]) of the hours parameter is reported by the value in the event column ([mm] [dd]).

Any suggestion?

CodePudding user response：

Both base R solutions below are cumsum/ave solutions. There are 2 solutions following Wimpel's comment to the question.

1.

If events cannot span multiple days.

i <- c(0, abs(diff(df1$hour)) != 1)
ave(cumsum(i), cumsum(i), FUN = length)
# [1] 1 6 6 6 6 6 6 1 8 8 8 8 8 8 8 8 1

2.

If events can span multiple days.

h <- with(df1, ISOdatetime(yyyy, mm, dd, hour, 0L, 0L))
j <- c(0, abs(diff(h)) != 1)
ave(cumsum(j), cumsum(j), FUN = length)
# [1] 1 6 6 6 6 6 6 1 8 8 8 8 8 8 8 8 1

And assign the result of ave to the new column.

df1$event <- ave(<as above>)

Data

df1 <-
structure(list(yyyy = c(2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
2011L, 2011L), mm = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L), dd = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), hour = c(10L, 12L, 13L, 
14L, 15L, 16L, 17L, 20L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
10L)), class = "data.frame", row.names = c(NA, -17L))

CodePudding user response：

Here's a dplyr solution:

df %>% 
  group_by(yyyy,mm,dd) %>% 
  mutate(event = ifelse(hour == lead(hour) -1 | hour == lag(hour)   1,1,0)) %>% 
  mutate(event = ifelse(is.na(event),0,event)) %>% 
  group_by(yyyy,mm,dd,event) %>% 
  mutate(event = sum(event)) %>% 
  mutate(event = ifelse(event == 0, 1, event)) # If removed, the ungrouped values will have event = 0

Output:

# A tibble: 17 x 5
# Groups:   yyyy, mm, dd, event [4]
    yyyy    mm    dd  hour event
   <int> <int> <int> <int> <dbl>
 1  2011     5     1    10     1
 2  2011     5     1    12     6
 3  2011     5     1    13     6
 4  2011     5     1    14     6
 5  2011     5     1    15     6
 6  2011     5     1    16     6
 7  2011     5     1    17     6
 8  2011     5     1    20     1
 9  2011     5     2    11     8
10  2011     5     2    12     8
11  2011     5     2    13     8
12  2011     5     2    14     8
13  2011     5     2    15     8
14  2011     5     2    16     8
15  2011     5     2    17     8
16  2011     5     2    18     8
17  2011     5     3    10     1