I have a time series associated with items. Sometimes there are missing periods in the middle. I would like to identify those observations that have a break in the time series. Take these data:
ref_no <- 1:6
X2015 <- c(2015, 2015, 2015, 2015, NA, NA)
X2016 <- c(2016, 2016, 2016, NA, 2016, 2016)
X2017 <- c(2017, 2017, NA, NA, NA, 2017)
X2018 <- c(2018, 2018, 2018, 2018, NA, 2018)
X2019 <- c(2019, 2019, 2018, NA, 2019, NA)
X2020 <- c(2020, NA, NA, 2020, 2020, NA)
df <- data.frame(ref_no, X2015, X2016, X2017, X2018, X2019, X2020)
There are breaks for ref_no items #3, #4 and #5, with #4 having two breaks (missing 2016, 2017 and then 2019).
ref_no X2015 X2016 X2017 X2018 X2019 X2020
1 1 2015 2016 2017 2018 2019 2020
2 2 2015 2016 2017 2018 2019 NA
3 3 2015 2016 NA 2018 2018 NA
4 4 2015 NA NA 2018 NA 2020
5 5 NA 2016 NA NA 2019 2020
6 6 NA 2016 2017 2018 NA NA
Can we have a variable that counts these breaks, e.g.
c <- c(0,0,1,2,1,0)
CodePudding user response:
We may use diff
to get the difference between adjacent element by each row.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ref_no, values_drop_na = TRUE) %>%
group_by(ref_no) %>%
summarise(break_count = sum(diff(value) > 1))
-output
# A tibble: 6 × 2
ref_no break_count
<int> <int>
1 1 0
2 2 0
3 3 1
4 4 2
5 5 1
6 6 0
Or using base R
apply(df[-1], 1, \(x) sum(diff(x[complete.cases(x)]) > 1))
[1] 0 0 1 2 1 0
CodePudding user response:
In base, we can look at the run length encoding of whether or not each row is NA
, drop the first and last values, and sum what remains.
apply(is.na(df[-1]), 1, \(x) {rle(x)$values |> head(-1) |> tail(-1) |> sum()})
# [1] 0 0 1 2 1 0