R remove values that do not fit into a sequence-CodePudding

I have a sequence s where I expect each proceeding value to be either the same as the previous one or 1.

s = c(1,1,1,1,2,2,2,-2,3,3,4,8,8,8,9,5,5,12,6)

What I want:

1,1,1,1,2,2,2,3,3,4,5,5,6

I've solved this with the following code:

counter = 2
repeat{
  
  if(s[counter] == s[counter-1] | s[counter] == s[counter-1] 1){
    counter = counter 1
  } else{
    s = s[-counter]
  }
  
  if(counter >= length(s)) break
}

which however appears quite 'dirty' and inefficient. Is there a computationally less time-consuming solution?

CodePudding user response：

I can't see an easy vector based solution, but a normal for loop with preallocation could help here

s = c(1,1,1,1,2,2,2,-2,3,3,4,8,8,8,9,5,5,12,6)
increasing_seq <- function(x) {
  keep <- logical(length(x))
  current <- x[1]
  for (i in seq_along(x)) {
    if (x[i] == current) {
      keep[i] <- TRUE
    } else if (x[i] == current   1) {
      current <- current   1
      keep[i] <- TRUE
    }
  }
  x[keep]
}
increasing_seq(s)
# [1] 1 1 1 1 2 2 2 3 3 4 5 5 6

Here we avoid recreating the s vector of different sizes. It's normally re-allocation that's slow, not looping.

CodePudding user response：

This can be done with Reduce:

Reduce(function(prev, this) 
  c(prev, if (any(this %in% (prev[length(prev)]   0:1))) this),
  s)
#  [1] 1 1 1 1 2 2 2 3 3 4 5 5 6

This can't be vectorized since the calc on one position relies on the results of the previous calculations, but this is compact and (imo) readable.

FYI, if performance is your primary metric, then it appears sindri_baldur's Rcpp and MrFlick's current answers win out by a landslide:

bench::mark(
sindri_baldur = {
  keep      = vector(length = length(s))
  keep[1]   = TRUE
  last_keep = 1L
  for (counter in 2:length(s)) {
    if ((s[counter] - s[last_keep]) %in% c(0, 1)) {
      last_keep = counter
      keep[counter] = TRUE
    }
  }
  s[keep]
},
sindri_baldur_rcpp = s[foo(s)],
r2evans = {
  Reduce(function(prev, this) 
    c(prev, if (any(this %in% (prev[length(prev)]   0:1))) this),
    s)
},
MrFlick = increasing_seq(s))
# # A tibble: 4 x 13
#   expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory     time     gc       
#   <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>     <list>   <list>   
# 1 sindri_baldur        2.62ms   3.05ms      317.   37.81KB     15.5   143     7    450.5ms <dbl [13]> <Rprofmem~ <bench_~ <tibble ~
# 2 sindri_baldur_rcpp      1us    1.4us   533120.    2.49KB      0   10000     0     18.8ms <dbl [13]> <Rprofmem~ <bench_~ <tibble ~
# 3 r2evans              28.5us   36.9us    23454.   24.67KB     18.8  9992     8      426ms <dbl [13]> <Rprofmem~ <bench_~ <tibble ~
# 4 MrFlick               2.1us    2.5us   345770.        0B      0   10000     0     28.9ms <dbl [13]> <Rprofmem~ <bench_~ <tibble ~

CodePudding user response：

Edit

R concept translated to Rcpp:

Rcpp::cppFunction('LogicalVector foo(NumericVector s) {
  int n = s.size();
  Rcpp::LogicalVector keep(n);
  keep[0]  = 1;
  int last = 0;
  for (int i = 1; i < n; i  ) {
    if        (s[i] - s[last] == 0) {
      keep[i] = 1;
    } else if (s[i] - s[last] == 1) {
      keep[i] = 1;
      last = i;
    } 
  }
  return keep;
}')


s[foo(s)]
# [1] 1 1 1 1 2 2 2 3 3 4 5 5 6

Original solution (very similiar but inferior to MrFlick's):

Another slightly more efficient R loop. However, if efficiency is important Rcpp might be a good choice.

keep      = vector(length = length(s))
keep[1]   = TRUE
last_keep = 1L
for (counter in 2:length(s)) {
  if ((s[counter] - s[last_keep]) %in% c(0, 1)) {
    last_keep = counter
    keep[counter] = TRUE
  }
}
s[keep]

# [1] 1 1 1 1 2 2 2 3 3 4 5 5 6