R: Is it possible to optimize the following function?-CodePudding

I am working with the R programming language.

I have the following data:

library("dplyr")

df <- data.frame(b = rnorm(100,5,5), d = rnorm(100,2,2),
                 c = rnorm(100,10,10))

a <- c("a", "b", "c", "d", "e")
a <- sample(a, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))

a<- as.factor(a)
df$a = a

> head(df)
           b          d          c a
1  3.1316480  0.5032860  4.7362991 a
2  4.3111450 -0.1142736 -0.5841322 c
3  2.8291346  3.6107839 16.0684492 a
4 14.2142245  4.9893987 -1.8145138 a
5 -6.7381302  0.0416782 -7.7675387 c
6  0.4481874  0.3370716 17.4260801 a

I also have the following function (my_subset_mean) which evaluates the mean of the "column c" given a specific choice of inputs:

 my_subset_mean <- function(r1, r2, r3){  
      subset <- df %>% filter(a %in% r1, b > r2, d < r3)
      return(mean(subset$c))
    }
    
    my_subset_mean(r1 = c("a", "b"), r2 = 5, r3 = 1 ) 
    [1] 5.682513

Question: Using the GA library in R, I am trying to optimize (mixed integer programming) the my_subset_mean function, according to the following constraints:

"r1" can take any combination of ("a", "b", "c", "d", "e") , e.g. "a", "a,c", "b, d, e", "a, b, c, d , e", "e, a" , etc.
"r2" can take any value between 0 and 1
"r3" can take any value between 0 and 1
However, my_subset_mean can also be calculated with unspecified values of "r1", "r2" or "r3", for example:

my_subset_mean(r1 = c("a", "b"), r2 = 5, r3 = NA)
my_subset_mean(r1 = NA,  r2 = 5, r3 = NA )

etc.

I tried to perform this optimization with the GA library:

library(GA)

GA <- ga(type = "real-valued", 
         fitness = function(x)  my_subset_mean(x[1], x[2], x[3]),
         lower = c(c("a", "b", "c", "d"), 1, 1), upper = c(c("a", "b", "c", "d"), 100, 100), 
         popSize = 50, maxiter = 1000, run = 100)

But I don't think this is the correct way to do it.

Thanks

What I tried in the past:

In a previous question (R: Adding "NA" factors to the "levels" function ), I learned how to optimize a similar function using "random grid search":

my_subset_mean <- function(r1=NA, r2=NA, r3=NA, r4 = NA) {  
  if (all(is.na(r1))) r1 <- unique(df$a)
  if (all(is.na(r4))) r4 <- unique(df$f)
  if (is.na(r2)) r2 <- -Inf
  if (is.na(r3)) r3 <- Inf
  s <- filter(df, a %in% r1 , f %in% r4, b > r2 , d < r3)
  return(mean(s$c))
}

create_output <- function() {
  uv <- levels(df$a)
  r1 <- sample(list(sample(uv, sample(length(uv))), NA), 1)[[1]]
  uv1 <- levels(df$f)
  r4 <-  sample(list(sample(uv1, sample(length(uv1))), NA), 1)[[1]]
  rgb <- range(df$b)
  rgd <- range(df$d)
  r2 <- sample(c(runif(1, rgb[1], rgb[2]), NA), 1)
  r3 <- sample(c(runif(1, rgd[1], rgd[2]), NA), 1)
  my_subset_mean <- my_subset_mean(r1, r2, r3, r4)
  data.frame(r1 = toString(r1), r4 = toString(r4), r2, r3, my_subset_mean)
}

set.seed(123)
out <- do.call(rbind, replicate(100, create_output(), simplify = FALSE))
head(out)

#            r1         r4        r2        r3 my_subset_mean
#1            NA          c        NA 4.2164973      12.095431
#2 a, b, c, d, e    b, a, c        NA 0.4394423       7.130999
#3            NA a, c, e, b  9.285701        NA       8.236054
#4            NA         NA 14.060829 3.8960888      10.562523
#5    c, b, a, d         NA        NA        NA       9.015613
#6            NA    a, c, d  2.251218        NA      10.070425

But can someone please show me how to do this with the "GA" function in R?

Thanks

Reference:

CodePudding user response：

The reason a local-search algorithm can handle such problems is that solutions are only "touched" by two functions, both of which you have to supply. The first is the objective function.

I have slightly rewritten yours:

my_subset_mean <- function(x){  
    subset <- df %>% filter(a %in% names(x$r1)[x$r1],
                            b > x$r2,
                            d < x$r3)
    ans <- -mean(subset$c)
    if (!is.finite(ans))
        ans <- 100
    ans
}

Instead of three arguments, it only takes one: a list of your original arguments. Also, I assume you want to maximize, so I put a minus in front of the mean. (The algorithm I am going to use later minimizes by default.) If a mean is not finite (NA, NaN), I simply return a large value as a marker for a "bad" solution. Just adjust this to your needs.

Start with an arbitrary but valid solution.

tmp <- !logical(length(sort(unique(a))))
names(tmp) <- sort(unique(a))

x <- list(r1 = tmp,
          r2 = 0.5,
          r3 = 0.5)

x
## $r1
##    a    b    c    d    e 
## TRUE TRUE TRUE TRUE TRUE 
## 
## $r2
## [1] 0.5
## 
## $r3
## [1] 0.5

I recreate your data. (I don't use factors but strings.)

library("dplyr")
df <- data.frame(b = rnorm(100,5,5), d = rnorm(100,2,2),
                 c = rnorm(100,10,10))

a <- c("a", "b", "c", "d", "e")
a <- sample(a, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))
df$a <- a

Evaluate x:

my_subset_mean(x)
## [1] -11.34132

Of course, this result depends on random data. Your numbers will differ.

Now, the second function: the neighbourhood. It takes a solution and returns a slightly modified version of it. Again, since you have to provide this function, you have complete control and hence any data structures can serve as input. Here is an example.

nb <- function(x) {
    i <- sample(c("r1", "r2", "r3"), 1)
    if (i == "r1") {
        j <- sample(length(x[[i]]), 1)
        x[[i]][j] <- !x[[i]][j]        
    } else {
        x[[i]] <- x[[i]]   runif(1, min = -0.1, max = 0.1)
        x[[i]] <- max(min(1, x[[i]]), 0)        
    }
    x
}

x  ## original solution
## $r1
##    a    b    c    d    e 
## TRUE TRUE TRUE TRUE TRUE 
## 
## $r2
## [1] 0.5
## 
## $r3
## [1] 0.5

nb(x)   ## ... and a neighbour
## $r1
##    a    b    c    d    e 
## TRUE TRUE TRUE TRUE TRUE 
## 
## $r2
## [1] 0.5
## 
## $r3
## [1] 0.42586

nb(x)   ## ... and another neighbour
## $r1
##     a     b     c     d     e 
##  TRUE FALSE  TRUE  TRUE  TRUE 
## 
## $r2
## [1] 0.5
## 
## $r3
## [1] 0.5

And that's it. With these two functions (objective and neighbourhood), you can run the actual algorithm. Here, I use Threshold Accepting.

library("NMOF")
ans <- TAopt(my_subset_mean, list(x0 = x, neighbour = nb, nI = 1000))

-my_subset_mean(ans$xbest)

I hope this gets you started with TAopt. For more on local-search methods, see this tutorial. Since you apparently want to filter a data-frame, perhaps this answer is also helpful: Finding ideal filter setting to maximize target function . Disclosure: I am the maintainer of package NMOF.

CodePudding user response：

(Not an answer)

@ Enrico Schumann: Just to make sure I understood your answer correctly, I added an extra categorical variable ("e" and "r4") to the data and then ran the optimization function. It seemed to work, but could you please take a look if you have time?

### create data
df <- data.frame(b = rnorm(100,5,5), d = rnorm(100,2,2),
                 c = rnorm(100,10,10))

a <- c("a", "b", "c", "d", "e")
a <- sample(a, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))
df$a <- a

e <- c("a", "b", "c", "d", "e")
e <- sample(e, 100, replace=TRUE, prob=c(0.3, 0.2, 0.3, 0.1, 0.1))
df$a <- e


#create function to be optimized
my_subset_mean <- function(x){  
    subset <- df %>% filter(a %in% names(x$r1)[x$r1], e %in% names(x$r4)[x$r4],
                            b > x$r2,
                            d < x$r3)
    ans <- -mean(subset$c)
    if (!is.finite(ans))
        ans <- 100
    ans
}


#store values of categorical variables into temporary objects

tmp <- !logical(length(sort(unique(a))))
names(tmp) <- sort(unique(a))

tmp1 <- !logical(length(sort(unique(e))))
names(tmp1) <- sort(unique(e))

x <- list(r1 = tmp, r4 = tmp1,
          r2 = 0.5,
          r3 = 0.5)




### optimization
nb <- function(x) {
    i <- sample(c("r1", "r2", "r3", "r4"), 1)
    if (i == "r1" & i == "r4") {
        j <- sample(length(x[[i]]), 1)
        x[[i]][j] <- !x[[i]][j]        
    } else {
        x[[i]] <- x[[i]]   runif(1, min = -0.1, max = 0.1)
        x[[i]] <- max(min(1, x[[i]]), 0)        
    }
    x
}


library("NMOF")
ans <- TAopt(my_subset_mean, list(x0 = x, neighbour = nb, nI = 1000))

-my_subset_mean(ans$xbest)

I am still a bit confused - in this algorithm, where would you do specify the upper and lower bounds for r2 and r3? For example, if I want to specify that r2 between (0,2) and r3 between (0,1.5) - where exactly can I specify this?

Thanks!