R - mutate new column that tells me if any row in the next x number of row is below given value-CodePudding

I need to create a new column that tells me if a process that started on a given date encountered a temperature below zero. I'd like to iterate over every row of the dataframe, checking a group of rows that begins with the current row and includes the next x number of rows, where x is the value in the 'duration' column. If any of the temp values in that group of rows is below zero, then the value in the new column would be TRUE.

set.seed(1)
df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
                    temp = runif(100,min=-2,max=30),
                    duration = sample.int(20,100,replace=TRUE))

> head(df, n=10)
         date        temp duration
1  2020-01-01  6.49627722       12
2  2020-01-02  9.90796479       16
3  2020-01-03 16.33130763        1
4  2020-01-04 27.06264928       13
5  2020-01-05  4.45382179        6
6  2020-01-06 26.74846992       17
7  2020-01-07 28.22960860        9
8  2020-01-08 19.14552936        7
9  2020-01-09 18.13164940       19
10 2020-01-10 -0.02283935       18

For each row I'm trying to answer the question "Did the process that started on this date and lasted [duration] days encounter a temperature below zero?" Since the value of 'duration' in the first row is 12, and there is a negative number in the tenth row of the 'temp' column, then the value of the new column would be TRUE. The second row would also be TRUE, the third row would be FALSE, and the following six rows would be TRUE.

I've been playing with the map_lgl function of purrr, but I'm not getting anywhere.

CodePudding user response：

You can use zoo::rollapply, which is vectorised in both data and width.

library(zoo)
df %>%
    mutate(answer = rollapply(
        temp, 
        width = duration, 
        FUN = function(x) any(x < 0), 
        align = "left", 
        fill = NA))
#          date        temp duration answer
#1   2020-01-01  6.49627722       12   TRUE
#2   2020-01-02  9.90796479       16   TRUE
#3   2020-01-03 16.33130763        1  FALSE
#4   2020-01-04 27.06264928       13   TRUE
#5   2020-01-05  4.45382179        6   TRUE
#6   2020-01-06 26.74846992       17   TRUE
#7   2020-01-07 28.22960860        9   TRUE

CodePudding user response：

You could do it with a for loop, but @Maurits Evers method is much better:

set.seed(1)
df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
                temp = runif(100,min=-2,max=30),
                duration = sample.int(20,100,replace=TRUE))

for (i in 1:nrow(df)) {
  df$encounter[i] <- ifelse(min(df[i:(i   df$duration[i]),]$temp) < 0, "yes", "no")
}
head(df, 10)
#>          date        temp duration encounter
#> 1  2020-01-01  6.49627722       12       yes
#> 2  2020-01-02  9.90796479       16       yes
#> 3  2020-01-03 16.33130763        1        no
#> 4  2020-01-04 27.06264928       13       yes
#> 5  2020-01-05  4.45382179        6       yes
#> 6  2020-01-06 26.74846992       17       yes
#> 7  2020-01-07 28.22960860        9       yes
#> 8  2020-01-08 19.14552936        7       yes
#> 9  2020-01-09 18.13164940       19       yes
#> 10 2020-01-10 -0.02283935       18       yes

Edit

Using 'purrr' syntax:

library(dplyr)
library(purrr)

set.seed(1)

df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
                temp = runif(100,min=-2,max=30),
                duration = sample.int(20,100,replace=TRUE))
df %>%
  mutate(answer = map(seq_along(temp), ~ifelse(min(df[.x:(.x   df$duration[.x]),]$temp, na.rm = TRUE) < 0, TRUE, FALSE))) %>%
  head(10)
#>          date        temp duration answer
#> 1  2020-01-01  6.49627722       12   TRUE
#> 2  2020-01-02  9.90796479       16   TRUE
#> 3  2020-01-03 16.33130763        1  FALSE
#> 4  2020-01-04 27.06264928       13   TRUE
#> 5  2020-01-05  4.45382179        6   TRUE
#> 6  2020-01-06 26.74846992       17   TRUE
#> 7  2020-01-07 28.22960860        9   TRUE
#> 8  2020-01-08 19.14552936        7   TRUE
#> 9  2020-01-09 18.13164940       19   TRUE
#> 10 2020-01-10 -0.02283935       18   TRUE

^{Created on 2022-06-28 by the reprex package (v2.0.1)}

CodePudding user response：

Using map2lgl

df$inside <- map2_lgl(df$duration,seq_len(nrow(df)),  ~ any(df$temp[.y:(.x   .y)] < 0))

Edit:

Btw, I tried really hard to make it work with dplyr. Don't yet have the sectret sauce but I'll update if I get it.

2nd Edit:

dplyr and base only. Don't know if you can do it without grouping the df

df |> 
  group_by(cumsum(row_number() %in% which(temp < 0 ))) |> 
  mutate(inside = ifelse(row_number() == 1 & !cur_group_id() <= 1, TRUE, 
 (duration   row_number()) > length(cur_group_rows())))

CodePudding user response：

In case you have some need to do this quickly, see below.

Note on question interpretation:

It's not clear to me based on the description if you want to include duration rows or duration - 1 rows, e.g. if duration is 2 on row 1, should I check rows 1 and 2 or rows 1, 2, and 3? In this answer I assume you want to check rows 1, 2, and 3. If you want to only check rows 1 and 2 in that example, change duration[i] below to duration[i] - 1.

Function explanation:

Get all the indices of all rows with low temps (negtemp = which(temp < 0)). Start with the first index (negtemp[j] with j = 1). For the first row of df, check if the end of your interval is after that index. If it is, return TRUE for that row because it has a low temp in the interval. Now do the same for the next row, etc. If at any point the row you're on is after negtemp[j], move to the next index by setting j <- j 1. Unless you're already at the last element of negtemp, in which case don't change j and just set out[i] to FALSE for that row and all the rest of the rows.

This way, you're looping through all the rows only once and for each row only doing a few additions and comparisons (vs creating a new temp vector at each row).

set.seed(1)
df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
                    temp = runif(100,min=-2,max=30),
                    duration = sample.int(20,100,replace=TRUE))

library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.2

check_lowtemp = function(temp, duration){
  negtemp = which(temp < 0)
  j = 1

  out = logical(length(temp))
  for (i in seq_along(temp)) {
    if (i > negtemp[j]) j = min(length(negtemp), j   1)
    out[i] = i <= negtemp[j] && (i   duration[i]) >= negtemp[j]
  }
  out
}

df %>% 
  mutate(lowtemp = check_lowtemp(temp, duration)) %>% 
  head
#>         date      temp duration lowtemp
#> 1 2020-01-01  6.496277       12    TRUE
#> 2 2020-01-02  9.907965       16    TRUE
#> 3 2020-01-03 16.331308        1   FALSE
#> 4 2020-01-04 27.062649       13    TRUE
#> 5 2020-01-05  4.453822        6    TRUE
#> 6 2020-01-06 26.748470       17    TRUE

Benchmark:

library(bench)
library(zoo)
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric

df = df %>% slice(sample(row_number(), 5e5, TRUE))

mark(
  a = {
    df %>%
      mutate(answer = rollapply(
          temp < 0, 
          width = duration, 
          FUN = any,
          align = "left", 
          fill = NA))
  }
  , b = {
      df %>% 
        mutate(lowtemp = check_lowtemp(temp, duration))
  }
  , check = FALSE
) 
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 a              6.9s     6.9s     0.145  238.01MB     7.53
#> 2 b            62.9ms   69.7ms    13.4      5.82MB     1.92

^{Created on 2022-06-27 by the reprex package (v2.0.1)}