Why does stochastic gradient descent lead us to a minimum at all? [closed]-CodePudding

Closed. This question needs to be more

From this example, you can see, which assumptions are made when using SGD. The parameter α has to be chosen small enough such that you are not jumping around wildly, but just move slightly towards the minimum of each batch. If you were to compute the corrections of all batches at once with respect to the same value of r, instead of updating r between each batch, and then applying all corrections at once, you would actually be performing a step of (batch) gradient decent. If α is small, then r does not change significantly when the correction of a single batch is applied. Hence, the effect of computing the corrections of all batches at once is very similar to computing the corrections after each update, and thus (batch) gradient decent and SDG converge to the same local minimum.

Page link：https//www.codepudding.com/Mobile/142382.html

Prev:Neural Network Processing Data [closed]

Next:How to keep row index after running predict_proba() in Scikit-learn?

Tags：

algorithm

machine-learning

math

optimization

gradient-descent

Links：
CodePudding