I'm trying to a Steepest descent for a function with 2 variables. It works fine with known step size which = 0.3. But I want to find a way to optimize step size and create a function to find a good step size. I found something called Armijo–Goldstein condition but I didn't understand it and the formula was kind of confusing for me. So I ask your help, if you have an idea how to balance this 'cause everything that connected to the step size at my code is wrong, I think. It have to calculate step size deepening on x and y we're on, I guess.
x, y = f.optimal_range() ##getting random start
step = 0.3 ## <--- This number have to be random between 0 to 1. But my step size calculations are wrong so I can't do it.
while f.fprime_x(x) != 0: ##loop to find 0 point for derivative of a function on x
fx = -f.fprime_x(x)
x = x (step * fx)
print(x)
if not f.delta_check(step, x, y): <--- Here's the problem. By the defenition the step have to be smaller if it doesn't pass the check, but if I make it smaller - it enters the eternal loop around the mi
step = step * 1.001
while f.fprime_y(y) != 0: ##loop to find 0 point for derivative of a function on x
fy = -f.fprime_y(y)
y = y (step * fy)
print(x, y)
if not f.delta_check(step, x, y):
step = step * 1.001
print(f"\n\nMIN ({x}, {y})")
Here's the function of delta / step size checking:
def delta_check(delta, x, y):
ux = -fprime_x(x)
uy = -fprime_y(y)
f_del_xy = func(x (delta * ux), y (delta * uy))
return f_del_xy <= func(delta * ux, delta * uy) delta
CodePudding user response:
Here's a notional Armijo–Goldstein implementation. Can't test it without a data function example, though.
# both should be less than, but usually close to 1
c = 0.8 # how much imperfection in function improvement we'll settle up with
tau = 0.8 # how much the step will be decreased at each iteration
x = np.array(f.optimal_range()) # assume everything is a vector; x is an n-dimensional coordinate
# NOTE: the part below is repeated for every X update
step = 0.3 # alpha in Armijo–Goldstein terms
gradient = np.array(f.fprime_x(x[0]), f.fprime_y(x[1]), ...)
# in the simplest case (SGD) p can point in the direction of gradient,
# but in general they don't have to be the same, e.g. because of added momentum
p = -gradient / ((gradient**2).sum() **0.5)
m = gradient.dot(p) # "happy case" improvement per unit step
t = - c * m # improvement we'll consider good enough
# func(*x) might be worth precomputing
while func(*x) - func(*(x step*p)) < step * t: # good enough step size found
step *= tau
# update X and repeat