Home > OS >  adjust "width" argument in rollapply() function in r for discontinuous dates
adjust "width" argument in rollapply() function in r for discontinuous dates

Time:10-24

I have a dataset of daily remotely sensed data. In short, it's reflectance (values between 0 and 1) for the last 20 years. Because it's remotely sensed data, some dates do not have a value because of clouds or some other obstruction.

I want to use rollapply() in R's zoo package to detect in the time series when the values remain at 1.0 for a certain amount of time (let's say 2 weeks) or at 0 for that same amount of time.

I have code to do this, but the width argument in the rollapply() function (the 2-week threshold mentioned in the previous paragraph) looks at data points rather than time. So it looks at 14 data values rather than 14 days, which may span over a month due to the missing data values from cloud cover etc.

Here's an example:

test_data <- data.frame(date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"), 
                        value = c(0, 1, 1, 1, 0))

test_data$date <- ymd(test_data$date)

select_first_1_value <- test_data %>%
  mutate(value = rollapply(value, width = 3, min, align = "left", fill = NA, na.rm = TRUE)) %>%
  filter(value == 1) %>%
  filter(row_number() == 1) %>%
  ungroup

With the argument as width = 3, it works. It finds that 2000-01-02 is the first date where a value = 1 occurs for at least 3 values. However, if I change this to 14, it no longer works, because it only sees 5 values in this instance. Even if I wrote out an additional 10 values that equal 1 (for a total of 15), it would be incorrect because the value = 0 at 2000-01-18 and it is only counting data points and not dates.

But when we look at the dates, there are missing dates between 2000-01-03 and 2000-01-17. If both are a value = 1, then I want to extract 2000-01-02 as the first instance where the time series remains at 1 for at least 14 consecutive days. Here, I'm assuming that the values are 1 for the missing days.

Any help is greatly appreciated. Thank you.

CodePudding user response:

There really are two problems here:

  1. How to roll by date rather than number of points.
  2. how to find the first stretch of 14 days of 1's assuming that missing dates are 1.

Note that (2) is not readily solved by (1) because the start of the first series of ones may not be any of the listed dates! For example, suppose we change the first date to Dec 1, 1999 giving test_data2 below. Then the start of the first period of 14 ones is Dec 2, 1999. It is not any of the dates in the test_data2 series.

test_data2 <- data.frame(
  date = c("1999-12-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"), 
  value = c(0, 1, 1, 1, 0))

1) What we need to do is not roll by date but rather expand the series to fill in the missing dates giving zz and then use rollapply. Below do that by creating a zoo series (which also converts the dates to Date class) and then convert that to ts class. Because ts class can only represent regularly spaced series that conversion will fill in the missing dates and provide a value of NA for them. We can fill those in with 1 and then convert back to zoo with Date class index.

library(zoo)

z <- read.zoo(test_data2)
zz <- z |> as.ts() |> na.fill(1) |> as.zoo() |> aggregate(as.Date)
r <- rollapply(zz, 14, min, na.rm = TRUE, partial = TRUE, align = "left")
time(r)[which(r == 1)[1]]
## [1] "1999-12-02"

2) Another way to solve this not involving rollapply at all would be to use rle. Using zz from above

ok <- with(rle(coredata(zz)), rep(lengths >= 14 & values == 1, lengths))
tt[which(ok)[1]]
## [1] "1999-12-02"

3) Another way without using rollapply is to extract the 0 value rows and then keep only those whose difference exceeds 14 days from the next 0 value row. Finally take the first such row and use the date one day after it.

library(dplyr)
test_data %>%
  mutate(date = as.Date(date)) %>%
  filter(value == 0) %>%
  mutate(diff = as.numeric(lead(date) - date)) %>%
  filter(diff > 14) %>%
  head(1) %>%
  mutate(date = date   1)
##         date value diff
## 1 2000-01-02     0   17

rollapply over dates rather than points

4) The question also discussed using rollapply over dates rather than points which we address here. As noted above this does not actually solve the question of finding the first stretch of 14 ones so instead we show how to find the first date in the series which starts a stretch of at least 14 ones. In general, we do this by first calculating a width vector and then use rollapply in the usual way but with those widths rather than using a scalar width. Returning back to test_data from the question:

# using test_data from question
tt <- as.Date(test_data$date)

w <- findInterval(tt   13, tt, rightmost.closed = TRUE) - seq_along(tt)   1
r <- rollapply(test_data$value, w, min, fill = NA, na.rm = TRUE, align = "left")
tt[which(r == 1)[1]]
## [1] "2000-01-02"

There are further examples in ?rollapply showing how to roll by time rather than number of points.

sqldf

5) A completely different way of approaching the problem of finding the first 14 ones with a date in the series is to use an SQL self join. It joins the first instance of test aliased to a to a second instance b associating all b's within the indicated date range and taking the minimum of those creating a new column min14 with those minimums. The having clause then keeps only those rows for which min14 is 1 and of those the limit clause keeps the first. We then retain the date at the end.

library(sqldf)

test <- transform(test_data, date = as.Date(date))

sqldf("select a.*, min(b.value) min14
  from test a
  left join test b on b.date between a.date and a.date   13
  group by a.rowid
  having min14 = 1
  limit 1")$date
## [1] "2000-01-02"

CodePudding user response:

You may look into runner package where you can pass k as days/weeks etc. See this example, to sum the last 3 days of value.

library(dplyr)
library(runner)

test_data %>%
  mutate(date = as.Date(date), 
         sum_val = runner(value, k = "3 days", idx = date, f = sum))

#        date value sum_val
#1 2000-01-01     0       0
#2 2000-01-02     1       1
#3 2000-01-03     1       2
#4 2000-01-17     1       1
#5 2000-01-18     0       1

Notice row 4 has value 1 (and not 3) because there is only 1 value that occurred in last 3 days.

  • Related