Count 'breaks' in a sequence of columns-CodePudding

I have a time series associated with items. Sometimes there are missing periods in the middle. I would like to identify those observations that have a break in the time series. Take these data:

ref_no <- 1:6
X2015 <- c(2015, 2015, 2015, 2015,   NA,   NA)
X2016 <- c(2016, 2016, 2016,   NA, 2016, 2016)
X2017 <- c(2017, 2017,   NA,   NA,   NA, 2017)
X2018 <- c(2018, 2018, 2018, 2018,   NA, 2018)
X2019 <- c(2019, 2019, 2018,   NA, 2019,   NA)
X2020 <- c(2020,   NA,   NA, 2020, 2020,   NA)
df <- data.frame(ref_no, X2015, X2016, X2017, X2018, X2019, X2020)

There are breaks for ref_no items #3, #4 and #5, with #4 having two breaks (missing 2016, 2017 and then 2019).

  ref_no X2015 X2016 X2017 X2018 X2019 X2020
1      1  2015  2016  2017  2018  2019  2020
2      2  2015  2016  2017  2018  2019    NA
3      3  2015  2016    NA  2018  2018    NA
4      4  2015    NA    NA  2018    NA  2020
5      5    NA  2016    NA    NA  2019  2020
6      6    NA  2016  2017  2018    NA    NA

Can we have a variable that counts these breaks, e.g.

c <- c(0,0,1,2,1,0)

CodePudding user response：

We may use diff to get the difference between adjacent element by each row.

library(dplyr)
library(tidyr)
df %>% 
  pivot_longer(cols = -ref_no, values_drop_na = TRUE) %>% 
  group_by(ref_no) %>%
  summarise(break_count = sum(diff(value) > 1))

-output

# A tibble: 6 × 2
  ref_no break_count
   <int>       <int>
1      1           0
2      2           0
3      3           1
4      4           2
5      5           1
6      6           0

Or using base R

apply(df[-1], 1, \(x) sum(diff(x[complete.cases(x)]) > 1))
[1] 0 0 1 2 1 0

CodePudding user response：

In base, we can look at the run length encoding of whether or not each row is NA, drop the first and last values, and sum what remains.

apply(is.na(df[-1]), 1, \(x) {rle(x)$values |> head(-1) |> tail(-1) |> sum()})
# [1] 0 0 1 2 1 0