Finding consecutive values beneath a threshold-CodePudding

I am trying to write some code to define some boundary layers.

The data is arranged in the table like this (but there are many more rows):

Depth (um)	Replicate 1 (O2 Sat %)	Replicate 2 (O2 Sat %)	Replicate 3 (O2 Sat %)
0	10	11	11
-100	11	11	12
-200	13	12	11
-300	14	13	14
-400	15	15	15
-500	16	16	16

For each of these replicates I want to find the size of the boundary layer. I am defining the boundary layer as the distance above the surface (0um) at which the changes are <5% per 100um for for subsequent measurements. So I need to find the depth of the first row that results in this definition not being met. I also need it to make sure that the function is using rows 1-4 then 2-5 then 3-6 and so on not just moving down the rows 1-4,5-8 etc etc so that I can identify the first time this boundary layer definition is not met. I would like to detect this change for each replicate.

I have tried looking for some ways to approach it but I am not sure I am asking the correct questions because I am not exactly sure which type of functions to start with. I am assuming this may be some type of threshold or cutoff type function but I thought I would get some ideas on how to proceed as my searching was not getting me anywhere.

I appreciate any advice or ideas on how to get started on this. Thank you for your time in advance.

CodePudding user response：

I think something like this can be accomplished with diff and a moving sum (via stats::filter):

Make some example data:

dat <- read.table(text="
Depth   Replicate1      Replicate2 Replicate3
0       10      11      11
-100    11      11      12
-200    13      12      11
-300    14      13      14
-400    15      15      15
-500    16      16      16", header=TRUE)

Set some variables:

cutoff      <- 0.05
windowsize  <- 4

Calculate the percentage difference row to row:

percdiff    <- diff(as.matrix(dat[-1])) / dat[-nrow(dat), -1]
percdiff
#  Replicate1 Replicate2  Replicate3
#1 0.10000000 0.00000000  0.09090909
#2 0.18181818 0.09090909 -0.08333333
#3 0.07692308 0.08333333  0.27272727
#4 0.07142857 0.15384615  0.07142857
#5 0.06666667 0.06666667  0.06666667

Check the percentage difference is above 5% at each row:

percdiff_co <- percdiff > cutoff
percdiff_co
#  Replicate1 Replicate2 Replicate3
#1       TRUE      FALSE       TRUE
#2       TRUE       TRUE      FALSE
#3       TRUE       TRUE       TRUE
#4       TRUE       TRUE       TRUE
#5       TRUE       TRUE       TRUE

Calculate a moving sum for each replicate in the 4-observation window:

out <- stats::filter(percdiff_co, rep(1,windowsize), sides=1) 
out
#Time Series:
#Start = 1 
#End = 5 
#Frequency = 1 
#  [,1] [,2] [,3]
#1   NA   NA   NA
#2   NA   NA   NA
#3   NA   NA   NA
#4    4    3    3
#5    4    4    3

Check if the moving sum of greater than the cutoff is always TRUE:

out <- out == windowsize
tail(out, -(windowsize-1))
#     [,1]  [,2]  [,3]
#[1,] TRUE FALSE FALSE
#[2,] TRUE  TRUE FALSE

CodePudding user response：

Here is a potential solution using dplyr functions, but I don't understand what you mean by

I also need it to make sure that the function is using rows 1-4 then 2-5 then 3-6 and so on not just moving down the rows 1-4,5-8 etc etc so that I can identify the first time this boundary layer definition is not met.

so, this may not properly answer your question:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- data.frame(
  check.names = FALSE,
  `Depth (um)` = c(0L, -100L, -200L, -300L, -400L, -500L),
  `Replicate 1 (O2 Sat %)` = c(10L, 11L, 13L, 14L, 15L, 16L),
  `Replicate 2 (O2 Sat %)` = c(11L, 11L, 12L, 13L, 15L, 16L),
  `Replicate 3 (O2 Sat %)` = c(11L, 12L, 11L, 14L, 15L, 16L)
)

# To 'replace' the columns:
df %>%
  mutate(across(starts_with("Replicate"), 
                ~as.logical(cummax(abs(.x / lag(.x, default = first(.x)) - 1) > 0.05))))
#>   Depth (um) Replicate 1 (O2 Sat %) Replicate 2 (O2 Sat %)
#> 1          0                  FALSE                  FALSE
#> 2       -100                   TRUE                  FALSE
#> 3       -200                   TRUE                   TRUE
#> 4       -300                   TRUE                   TRUE
#> 5       -400                   TRUE                   TRUE
#> 6       -500                   TRUE                   TRUE
#>   Replicate 3 (O2 Sat %)
#> 1                  FALSE
#> 2                   TRUE
#> 3                   TRUE
#> 4                   TRUE
#> 5                   TRUE
#> 6                   TRUE

# To add the columns to the existing dataframe:
df %>%
  mutate(across(starts_with("Replicate"), 
                ~as.logical(cummax(abs(.x / lag(.x, default = first(.x)) - 1) > 0.05)),
                .names = "{paste0(.col, ' <5% change')}")) %>%
  select(1, 2, 5, 3, 6, 4, 7)
#>   Depth (um) Replicate 1 (O2 Sat %) Replicate 1 (O2 Sat %) <5% change
#> 1          0                     10                             FALSE
#> 2       -100                     11                              TRUE
#> 3       -200                     13                              TRUE
#> 4       -300                     14                              TRUE
#> 5       -400                     15                              TRUE
#> 6       -500                     16                              TRUE
#>   Replicate 2 (O2 Sat %) Replicate 2 (O2 Sat %) <5% change
#> 1                     11                             FALSE
#> 2                     11                             FALSE
#> 3                     12                              TRUE
#> 4                     13                              TRUE
#> 5                     15                              TRUE
#> 6                     16                              TRUE
#>   Replicate 3 (O2 Sat %) Replicate 3 (O2 Sat %) <5% change
#> 1                     11                             FALSE
#> 2                     12                              TRUE
#> 3                     11                              TRUE
#> 4                     14                              TRUE
#> 5                     15                              TRUE
#> 6                     16                              TRUE

^{Created on 2022-07-25 by the reprex package (v2.0.1)}