Faster ways to generate Yellowstone sequence (A098550) in R?-CodePudding

I just saw a YouTube video from Numberphile on the Yellowstone sequence (A098550). It's base on a sequence starting with 1 and 2, with subsequent terms generated by the rules:

no repeated terms
always pick the lowest integer
gcd(a_n, a_(n-1)) = 1
gcd(a_n, a_(n-2)) > 1

The first 15 terms would be: 1 2 3 4 9 8 15 14 5 6 25 12 35 16 7

A Q&D approach in R could be something like this, but understandably, this becomes very slow at attempts to make longer sequences. It also make some assumptions about the highest number that is possible within the sequence (as info: the sequence of 10,000 items never goes higher than 5000).

What can we do to make this faster?

library(DescTools)
a <- c(1, 2, 3)
p <- length(a)

# all natural numbers
all_ints <- 1:5000

for (n in p:1000) {
    # rule 1 - remove all number that are in sequence already
    next_a_set <- all_ints[which(!all_ints %in% a)]

    # rule 3 - search the remaining set for numbers that have gcd == 1
    next_a_option <- next_a_set[which(
        sapply(
            next_a_set,
            function(x) GCD(a[n], x)
        ) == 1
    )]

    # rule 4 - search the remaining number for gcd > 1
    next_a <- next_a_option[which(
        sapply(
            next_a_option,
            function(x) GCD(a[n - 1], x)
        ) > 1
    )]

    # select the lowest
    a <- c(a, min(next_a))
    n <- n   1
}

CodePudding user response：

Here's a version that's about 20 times faster than yours, with comments about the changes:

# Set a to the final length from the start.
a <- c(1, 2, 3, rep(NA, 997))
p <- 3

# Define a vectorized gcd() function.  We'll be testing
# lots of gcds at once.  This uses the Euclidean algorithm.
gcd <- function(x, y) { # vectorized gcd

  while (any(y != 0)) {
    x1 <- ifelse(y == 0, x, y)
    y <- ifelse(y == 0, 0, x %% y)
    x <- x1
  }
  x
}

# Guess at a reasonably large vector to work from,
# but we'll grow it later if not big enough.
allnum <- 1:1000

# Keep a logical record of what has been used
used <- c(rep(TRUE, 3), rep(FALSE, length(allnum) - 3))

for (n in p:1000) {
  # rule 1 - remove all number that are in sequence already
  # nothing to do -- used already records that.
  
  repeat {
    # rule 3 - search the remaining set for numbers that have gcd == 1

    keep <- !used & gcd(a[n], allnum) == 1

    # rule 4 - search the remaining number for gcd > 1
    keep <- keep & gcd(a[n-1], allnum) > 1
  
    # If we found anything, break out of this loop
    if (any(keep))
      break
    
    # Otherwise, make the set of possible values twice as big,
    # and try again
    allnum <- seq_len(2*length(allnum))
    used <- c(used, rep(FALSE, length(used)))
  }
  # select the lowest
  newval <- which.max(keep)
  # Assign into the appropriate place
  a[n 1] <- newval
  # Record that it has been used
  used[newval] <- TRUE  
}

If you profile it, you'll see it spends most of its time in the gcd() function. You could probably make that a lot faster by redoing it in C or C .

CodePudding user response：

The biggest change here is pre-allocation and restricting the search to numbers that have not yet been used.

library(numbers)
N <- 5e3
a <- integer(N)
a[1:3] <- 1:3
b <- logical(N) # which numbers have been used already?
b[1:3] <- TRUE
NN <- 1:N

system.time({
  for (n in 4:N) {
    a1 <- a[n - 1L]
    a2 <- a[n - 2L]
    for (k in NN[!b]) {
      if (GCD(k, a1) == 1L & GCD(k, a2) > 1L) {
        a[n] <- k
        b[k] <- TRUE
        break
      }
    }
    if (!a[n]) {
      a <- a[1:(n - 1L)]
      break
    }
  }
})
#>    user  system elapsed 
#>    1.28    0.00    1.28
length(a)
#> [1] 1137

For a fast C algorithm, see here.