|
||||||||||
|
DOI: 10.17587/it.25.662-669 M. D. Ershov, Ph. D. Student, e-mail: ershov.m.d@rsreu.ru, Ryazan State Radio Engineering University, Ryazan, 390005, Russian Federation First-Order Optimization Methods in Machine Learning The problems arising in the training of multilayer neural networks of direct distribution due to the disadvantages of the gradient descent method are considered. A review of first-order optimization methods which are widely used in machine learning and less well-known methods is performed. The review includes a brief description of one of training methods for neural networks: the backpropagation method (also known as the backward propagation of errors). A separate section is devoted to the gradient descent optimization method and convergence problems arising from the use of the backpropagation method with gradient descent. The review considers the following first-order optimization methods with adaptive learning rate: gradient descent with momentum, Nesterov accelerated gradient method (NAG), AdaGrad, RMSprop, AdaDelta, Adam, AdaMax, Nadam, AMSGrad, ND-Adam, NosAdam, Padam, and Yogi. The features of each method and the problems of their use in practice are described. It can be noted that gradient descent, momentum and NAG are basis for the AdaGrad, Adam and other methods used in machine learning. In addition, the learning rate adjustment is performed for each parameter separately at each iteration of the neural network training. In later works the deterioration of convergence and generalization ability is described, which is associated with the use of the exponential moving average (a short-term memory of gradients). Such methods as AMSGrad, NosAdam, Padam are aimed at solving this problem and take advantage of both Adam and the stochastic gradient descent. P. 662–669 |