I need to create a new column that tells me if a process that started on a given date encountered a temperature below zero. I'd like to iterate over every row of the dataframe, checking a group of rows that begins with the current row and includes the next x number of rows, where x is the value in the 'duration' column. If any of the temp values in that group of rows is below zero, then the value in the new column would be TRUE.
set.seed(1)
df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
temp = runif(100,min=-2,max=30),
duration = sample.int(20,100,replace=TRUE))
> head(df, n=10)
date temp duration
1 2020-01-01 6.49627722 12
2 2020-01-02 9.90796479 16
3 2020-01-03 16.33130763 1
4 2020-01-04 27.06264928 13
5 2020-01-05 4.45382179 6
6 2020-01-06 26.74846992 17
7 2020-01-07 28.22960860 9
8 2020-01-08 19.14552936 7
9 2020-01-09 18.13164940 19
10 2020-01-10 -0.02283935 18
For each row I'm trying to answer the question "Did the process that started on this date and lasted [duration] days encounter a temperature below zero?" Since the value of 'duration' in the first row is 12, and there is a negative number in the tenth row of the 'temp' column, then the value of the new column would be TRUE. The second row would also be TRUE, the third row would be FALSE, and the following six rows would be TRUE.
I've been playing with the map_lgl function of purrr, but I'm not getting anywhere.
CodePudding user response:
You can use zoo::rollapply
, which is vectorised in both data
and width
.
library(zoo)
df %>%
mutate(answer = rollapply(
temp,
width = duration,
FUN = function(x) any(x < 0),
align = "left",
fill = NA))
# date temp duration answer
#1 2020-01-01 6.49627722 12 TRUE
#2 2020-01-02 9.90796479 16 TRUE
#3 2020-01-03 16.33130763 1 FALSE
#4 2020-01-04 27.06264928 13 TRUE
#5 2020-01-05 4.45382179 6 TRUE
#6 2020-01-06 26.74846992 17 TRUE
#7 2020-01-07 28.22960860 9 TRUE
CodePudding user response:
You could do it with a for loop, but @Maurits Evers method is much better:
set.seed(1)
df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
temp = runif(100,min=-2,max=30),
duration = sample.int(20,100,replace=TRUE))
for (i in 1:nrow(df)) {
df$encounter[i] <- ifelse(min(df[i:(i df$duration[i]),]$temp) < 0, "yes", "no")
}
head(df, 10)
#> date temp duration encounter
#> 1 2020-01-01 6.49627722 12 yes
#> 2 2020-01-02 9.90796479 16 yes
#> 3 2020-01-03 16.33130763 1 no
#> 4 2020-01-04 27.06264928 13 yes
#> 5 2020-01-05 4.45382179 6 yes
#> 6 2020-01-06 26.74846992 17 yes
#> 7 2020-01-07 28.22960860 9 yes
#> 8 2020-01-08 19.14552936 7 yes
#> 9 2020-01-09 18.13164940 19 yes
#> 10 2020-01-10 -0.02283935 18 yes
Edit
Using 'purrr' syntax:
library(dplyr)
library(purrr)
set.seed(1)
df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
temp = runif(100,min=-2,max=30),
duration = sample.int(20,100,replace=TRUE))
df %>%
mutate(answer = map(seq_along(temp), ~ifelse(min(df[.x:(.x df$duration[.x]),]$temp, na.rm = TRUE) < 0, TRUE, FALSE))) %>%
head(10)
#> date temp duration answer
#> 1 2020-01-01 6.49627722 12 TRUE
#> 2 2020-01-02 9.90796479 16 TRUE
#> 3 2020-01-03 16.33130763 1 FALSE
#> 4 2020-01-04 27.06264928 13 TRUE
#> 5 2020-01-05 4.45382179 6 TRUE
#> 6 2020-01-06 26.74846992 17 TRUE
#> 7 2020-01-07 28.22960860 9 TRUE
#> 8 2020-01-08 19.14552936 7 TRUE
#> 9 2020-01-09 18.13164940 19 TRUE
#> 10 2020-01-10 -0.02283935 18 TRUE
Created on 2022-06-28 by the reprex package (v2.0.1)
CodePudding user response:
Using map2lgl
df$inside <- map2_lgl(df$duration,seq_len(nrow(df)), ~ any(df$temp[.y:(.x .y)] < 0))
Edit:
Btw, I tried really hard to make it work with dplyr. Don't yet have the sectret sauce but I'll update if I get it.
2nd Edit:
dplyr and base only. Don't know if you can do it without grouping the df
df |>
group_by(cumsum(row_number() %in% which(temp < 0 ))) |>
mutate(inside = ifelse(row_number() == 1 & !cur_group_id() <= 1, TRUE,
(duration row_number()) > length(cur_group_rows())))
CodePudding user response:
In case you have some need to do this quickly, see below.
Note on question interpretation:
It's not clear to me based on the description if you want to include duration
rows or duration - 1
rows, e.g. if duration
is 2
on row 1, should I check rows 1 and 2 or rows 1, 2, and 3? In this answer I assume you want to check rows 1, 2, and 3. If you want to only check rows 1 and 2 in that example, change duration[i]
below to duration[i] - 1
.
Function explanation:
Get all the indices of all rows with low temps (negtemp = which(temp < 0)
). Start with the first index (negtemp[j]
with j = 1
). For the first row of df
, check if the end of your interval is after that index. If it is, return TRUE
for that row because it has a low temp in the interval. Now do the same for the next row, etc. If at any point the row you're on is after negtemp[j]
, move to the next index by setting j <- j 1
. Unless you're already at the last element of negtemp
, in which case don't change j
and just set out[i]
to FALSE
for that row and all the rest of the rows.
This way, you're looping through all the rows only once and for each row only doing a few additions and comparisons (vs creating a new temp vector at each row).
set.seed(1)
df = data.frame(date = seq(as.Date("2020-01-01"),as.Date("2020-04-09"), by=1),
temp = runif(100,min=-2,max=30),
duration = sample.int(20,100,replace=TRUE))
library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.2
check_lowtemp = function(temp, duration){
negtemp = which(temp < 0)
j = 1
out = logical(length(temp))
for (i in seq_along(temp)) {
if (i > negtemp[j]) j = min(length(negtemp), j 1)
out[i] = i <= negtemp[j] && (i duration[i]) >= negtemp[j]
}
out
}
df %>%
mutate(lowtemp = check_lowtemp(temp, duration)) %>%
head
#> date temp duration lowtemp
#> 1 2020-01-01 6.496277 12 TRUE
#> 2 2020-01-02 9.907965 16 TRUE
#> 3 2020-01-03 16.331308 1 FALSE
#> 4 2020-01-04 27.062649 13 TRUE
#> 5 2020-01-05 4.453822 6 TRUE
#> 6 2020-01-06 26.748470 17 TRUE
Benchmark:
library(bench)
library(zoo)
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
df = df %>% slice(sample(row_number(), 5e5, TRUE))
mark(
a = {
df %>%
mutate(answer = rollapply(
temp < 0,
width = duration,
FUN = any,
align = "left",
fill = NA))
}
, b = {
df %>%
mutate(lowtemp = check_lowtemp(temp, duration))
}
, check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 a 6.9s 6.9s 0.145 238.01MB 7.53
#> 2 b 62.9ms 69.7ms 13.4 5.82MB 1.92
Created on 2022-06-27 by the reprex package (v2.0.1)