Simulated Annealing - Intuition-CodePudding

Cross posted from csexchange:

Most versions of simulated annealing I've seen are implemented similar to what is outlined in the wikipedia pseudocode below:

Let s = s0
For k = 0 through kmax (exclusive):
  T ← temperature( 1 - (k 1)/kmax )
  Pick a random neighbour, snew ← neighbour(s)
  If P(E(s), E(snew), T) ≥ random(0, 1):
    s ← snew
Output: the final state s

I am having trouble understanding how this algorithm does not get stuck in a local optima as the temperature cools. If we jump around at the start while the temp is high, and eventually only take uphill moves as the temp cools, then isn't the solution found highly dependent on where we just so happened to end up in the search space as the temperature started to cool? We may have found a better solution early on, jumped off of it while the temp was high, and then be in a worse-off position as the temp cools and we transition to hill climbing.

An often listed modification to this approach is to keep track of the best solution found so far. I see how this change mitigates the risk of "throwing away" a better solution found in the exploratory stage when the temp is high, but I don't see how this is any better than simply performing repeated random hill-climbing to sample the space, without the temperature theatrics.

Another approach that comes to mind is to combine the ideas of keeping track of the "best so far" with repeated hill climbing and beam search. For each temperature, we could perform simulated annealing and track the best 'n' solutions. Then for the next temperature, start from each of those local peaks.

CodePudding user response：

By my understanding, simulated annealing is not guaranteed to not get stuck in a local maxima (for maximization problems), especially as it "cools" where later in the cycle as k -> kmax.

So as you say, we "jump all over the place" at the start, but we are still selective in whether or not we accept that jump, as the P() function that determines the probability of acceptance is a function of the goal, E(). Later in the same wikipedia article, they describe the P() function a bit, and suggest that if e_new > e, then perhaps P=1, always taking the move if it is an improvement, but perhaps not always if it isn't an improvement. Later in the cycle, we aren't as willing to take random leaps to lesser results, so the algorithm tends to settle into a max (or min), which may or may not be the global.

CodePudding user response：

The challenge in global function evaluation is to obtain good efficiency, i.e. find the optimal solution or one that is close to the optimum with as little function evaluations as possible. So many "we could perform..." reasonings are correct in terms of convergence to a good solution, but may fail to be efficient.

Now the main difference between pure climbing and annealing is that annealing temporarily admits a worsening of the objective to avoid... getting trapped in a local optimum. Hill climbing only searches the subspace where the function takes larger values than the best so far, which may fail to be connected to the global optimum. Annealing adds some "tunneling" effect.

Temperature decrease is not the key ingredient here and by the way, hill climbing methods such as the Simplex or Hooke-Jeeves patterns also work with diminishing perturbations. Temperature schemes are more related to multiscale decomposition of the target function.