Home > Back-end >  Count 'breaks' in a sequence of columns
Count 'breaks' in a sequence of columns

Time:09-06

I have a time series associated with items. Sometimes there are missing periods in the middle. I would like to identify those observations that have a break in the time series. Take these data:

ref_no <- 1:6
X2015 <- c(2015, 2015, 2015, 2015,   NA,   NA)
X2016 <- c(2016, 2016, 2016,   NA, 2016, 2016)
X2017 <- c(2017, 2017,   NA,   NA,   NA, 2017)
X2018 <- c(2018, 2018, 2018, 2018,   NA, 2018)
X2019 <- c(2019, 2019, 2018,   NA, 2019,   NA)
X2020 <- c(2020,   NA,   NA, 2020, 2020,   NA)
df <- data.frame(ref_no, X2015, X2016, X2017, X2018, X2019, X2020)

There are breaks for ref_no items #3, #4 and #5, with #4 having two breaks (missing 2016, 2017 and then 2019).

  ref_no X2015 X2016 X2017 X2018 X2019 X2020
1      1  2015  2016  2017  2018  2019  2020
2      2  2015  2016  2017  2018  2019    NA
3      3  2015  2016    NA  2018  2018    NA
4      4  2015    NA    NA  2018    NA  2020
5      5    NA  2016    NA    NA  2019  2020
6      6    NA  2016  2017  2018    NA    NA

Can we have a variable that counts these breaks, e.g.

c <- c(0,0,1,2,1,0)

CodePudding user response:

We may use diff to get the difference between adjacent element by each row.

library(dplyr)
library(tidyr)
df %>% 
  pivot_longer(cols = -ref_no, values_drop_na = TRUE) %>% 
  group_by(ref_no) %>%
  summarise(break_count = sum(diff(value) > 1))

-output

# A tibble: 6 × 2
  ref_no break_count
   <int>       <int>
1      1           0
2      2           0
3      3           1
4      4           2
5      5           1
6      6           0

Or using base R

apply(df[-1], 1, \(x) sum(diff(x[complete.cases(x)]) > 1))
[1] 0 0 1 2 1 0

CodePudding user response:

In base, we can look at the run length encoding of whether or not each row is NA, drop the first and last values, and sum what remains.

apply(is.na(df[-1]), 1, \(x) {rle(x)$values |> head(-1) |> tail(-1) |> sum()})
# [1] 0 0 1 2 1 0
  • Related