Home > Software engineering >  Extract all subsets in vector where elements are above a given threshold
Extract all subsets in vector where elements are above a given threshold

Time:11-30

I would like to know if there is an R way (one liner) to extract the coordinates of all subsets of a vector that are above a given threshold. Suppose I have the following data:

v =  c(3.48, 2.59, 1.73, 0.91, 0.13, -0.63, -1.34, -2.03, -2.67, -3.28, -3.04, -2.15, -1.20, -0.19, 0.84, 1.86, 2.84, 3.77, 4.60, 5.31, 4.16, 2.87, 1.89, 0.51, 0.23, 0.78, 1.34, 2.63, 1.72, 0.62, 0.98, 1.45)

and let's say I have threshold = 0.7. The desired output would be:

left    right
1       4
15      23
26      29
31      32

I can in principle write a while loop or some sort, subsetting v and juggling with left and right coordinates of these regions, something like:

left = which(subset >= threshold)[1]   right
right = which(subset[left:length(subset)] < threshold)[1] - 1 # -1 to get the last element above the threshold

subset = v[(right   1):length(v)]

(not tested), but I am sure there is an R way that i can't seem to remember.

I had a look here but it's not really what I am after. Any help is appreciated.

CodePudding user response:

You can use rle() to find the runs of values that exceed your threshhold. When you can turn that into your desired format

rle(v>.7) |>
  with(
    data.frame(start=1, end=cumsum(lengths)) |> 
      transform(start=c(1, head(end, -1)   1)) |> 
      subset(values)
  )

And that returns

  start end
1     1   4
3    15  23
5    26  29
7    31  32

This is nearly identical to this existing question with the main difference of using rle() on your Boolean condition and then subsetting to only the TRUE values.

CodePudding user response:

Same solution but using data.table

v =  c(3.48, 2.59, 1.73, 0.91, 0.13, -0.63, -1.34, -2.03, -2.67, -3.28, -3.04, -2.15, -1.20, -0.19, 0.84, 1.86, 2.84, 3.77, 4.60, 5.31, 4.16, 2.87, 1.89, 0.51, 0.23, 0.78, 1.34, 2.63, 1.72, 0.62, 0.98, 1.45)

data.table(v)[, .(start = .I[1], end = .I[.N], keep = unique(v > 0.7)), by = rleid(v > 0.7)][keep == T, .(start, end)]

#    start end
# 1:     1   4
# 2:    15  23
# 3:    26  29
# 4:    31  32
  • Related