# Contents

- Introduction
- Motivation
- Newton’s Method
- Steepest Descent Method
- Limitation
- Sources

# Introduction

In many sciences, the problem of non-linear optimization appears quite often. What happens if they’re non-linear? Sometimes, the system of equations remains easy to solve even given non-linearity, but other times, the possibility of solving the problem analytically may not exist if the algebra gets too complicated. The solution lies in using numerical methods (approximations). Here, I present two common methods: Newton’s Method, and the Steepest Descent Method.

Each method has its own limitations and the choice of which method to use depends on the problem. These methods involve using iterative computation in order to reach a solution. Luckily, computers can prove useful in providing solutions very quickly due to its quick processing power for numerical computation.

In order to understand some of the concepts and terms used in this article, the reader has knowledge in multivariable calculus. Also, equations in picture form made the line spacing look funny, so I’m going to do my best write the equations as text. I should also point out that I couldn’t write the vector arrow on top of the variables in text form so vectors are written in bold.

# Motivation

In calculus, the idea of optimization involves taking the derivative of the function and equating it to zero. In multivariable calculus, the same idea applies except, we set all the partial derivatives to zero and solve the system.

So in general, if we are given a function:

*ƒ( x)*, where

**x**= [x

_{1}, x

_{2}, … x

_{n}]

^{T}

The maximum or minimum can be found by taking a partial derivative with respect to the vector ** x**. This gives the following equations:

*∂ ƒ/dx_{1} = 0*

*∂ ƒ/dx_{2} = 0*

…

*∂ ƒ/dx_{n} = 0*

A short-hand way of expressing the above equations is by using the gradient vector, giving*∇ƒ( x) = 0*. By definition, the gradient is defined by:

*∇ƒ( x) = [∂ƒ/dx_{1}, ∂ƒ/dx_{2}, … , ∂ƒ/dx_{n}].*

An analytical solution is not always possible, hence the motivation for approximating the solution using numerical methods.

# Newton’s Method

In general, this method is written below. Explanation of each step will follow.

- Given the equations to solve, derive equations for the approximating line or plane, depending on whether the problem is 2D, 3D, nD, etc.
- Using the linearized equations, solve the system of equations
- Using the results from step 2, generate new equations for a new approximating line or plane, and repeat step 2.

This method also involves starting with a guess, denoted by a variable with a subscript of 0. A guess is required, because this is the starting point for this method. Using this guess, each successive solution *should* converge (more on that later).

## Step 1

In order to solve the equations *∇ƒ( x) = 0* (see Motivation for more information), we can apply multivariate Newton’s method to approximate the solution. Like the single-variate (two-dimensional) version, a linear approximation is required. However, instead of having an approximation in the form of a line, the approximation is a plane instead.

2D (line): *ƒ(x) ≈ ƒ(x _{0}) + ƒ’(x_{0})(x – x_{0})*

3D (plane): *ƒ(x, y) ≈ ƒ(x _{0}, y_{0}) + ƒ_{x}(x_{0}, y_{0})(x – x_{0}) + ƒ_{y}(x_{0}, y_{0})(y – y_{0})*

nD (nD plane): *ƒ( x) ≈ ƒ(x_{0}) + ∇ƒ(x_{0})(x – x_{0})*

Since we are trying to optimize, we have the partial derivatives of the function we are trying to optimize, but in order to apply Newton’s Method, we need a linear approximation of the partial derivatives. This means that in order to approximate the partial derivatives, we need to take another derivative in order to get the slopes of the approximating planes. If that description didn’t make sense, try the following. For an equation with two independent variables, there will be two equations for the partial derivatives, one for x and one for y. Taking partial derivatives from each of those equations again will give the slopes of the plane that approximate the first partial derivative. So if we have two independent variables, we end up getting four equations: ƒ_{xx}, ƒ_{xy},ƒ_{yy}, and ƒ_{yx} (ƒ_{xy }and ƒ_{yx} should be the same). In general, for n partial derivatives, we would have *n²* equations if we took another set of derivatives. The approximations for both of the partial derivatives would be:

*ƒ _{x}(x, y) ≈ ƒ_{x}(x_{0}, y_{0}) + ƒ_{xx}(x_{0}, y_{0})(x – x_{0}) + ƒ_{xy}(x_{0}, y_{0})(y – y_{0})*

*ƒ _{y}(x, y) ≈ ƒ_{y}(x_{0}, y_{0}) + ƒ_{yy}(x_{0}, y_{0})(x – x_{0}) + ƒ_{yx}(x_{0}, y_{0})(y – y_{0})*

Plug in the values for the second derivatives, and the values from the initial guess (the x y values and what function value you get using those values).

## Step 2

Although the equations generated from step 1 is yet another system of equations, the main difference is that the equations from step 1 are **linear**. This means that the system can be solved analytically. Solve for the solution of the system using row reduction methods, or other methods. The solution from solving the system will be the values used for the next iteration and will replace the guessed values. The solution may or may not be close to the actual solution. More iterations need to be done to see if the answer changes with each iteration. Usually if the guess is really far away, then it takes a lot of iterations to get it really close. However, sometimes, a bad guess may cause the solution to diverge, meaning that the solution will be further and further off.

## Step 3

Generate new equations for the approximating planes using the solution from step 2. Using these equations, go back to step 2 and solve it again. This method is super tedious if done on paper. Matlab can help, but then again, Matlab can do the entire approximation with just a few commands, but not everyone has access to it unfortunately.

Hopefully after a few iterations, the solution stops changing. When that happens, you can substitute the final solution back into the original equation to see if it works!

# Steepest Descent Method

Here is the general method:

- Find the direction of the maximum or minimum directional derivative
- Follow it until a maximum or minimum is reached

Newton’s Method is sufficient in a lot of cases, but sometimes it may be too hard to go beyond first derivatives. This main idea of this method is to use the first derivative and follow the direction in which the function gets larger/smaller. A fundamental concept used in this method is the directional derivative. Since in 3D (or higher dimensions), we are not restricted to movement along the x axis. This means we could move in any random direction, and in each direction the rate of change is going to be different. To calculate the max or min, we need to calculate where the directional derivative is largest or smallest. Just like a boulder rolling down the steepest slope on the hill.

Starting with a guess, we can find out the value of the function at the point and the direction to travel in. Take a step in that direction and see how the value of the function changed. Keep doing that until a max or min is reached.

## Step 1

The definition of the directional derivative is the dot product of the gradient and the unit vector specifying the direction. Here is the definition of the directional derivative:

*D _{u}ƒ = ∇ƒ(x)·u*

Since the definition involves a dot product, it is true that if the two vectors are orthogonal, the resulting dot product is zero. The direction that this derivative is in is called the *level direction*. But if the two vectors are parallel, we get the highest value of the directional derivative. For a vector that is parallel in the same direction, the directional derivative will be a maximum, but if the vector is antiparallel (parallel but both pointing opposite ways), the directional derivative will be a minimum.

Since the maximum value of the directional derivative is parallel to the gradient, we can calculate the unit vector of the gradient and use it as the unit vector ** u. **In order to get the minimum, simply flip the direction of the vector by multiplying it by –1.

**u = **∇ƒ(**x**)/|∇ƒ(**x**)|

Although not necessarily required, substituting this value into the definition of the directional derivative, and doing some simplification with algebra, the maximum and minimum directional derivative is given by:

*Max D _{u}ƒ = |∇ƒ(x)|*

*Min D _{u}ƒ = -|∇ƒ(x)|*

Using the initial guess, calculate the direction of to travel in at that point.

## Step 2

Knowing the direction to travel choose a desired step size “h”. This means that if the equation will be traversed in a particular direction, we need to know exactly *how* far to go. Choosing smaller step sizes gives better computational accuracy but increases computation time.

To get to the maximum, use the direction for the maximum directional derivative. To get the minimum, use the direction for the smallest directional derivative.

Starting with a guess, calculate the function value at that point. Then take a step in the direction given by Step 1 using the step size. The variable “u” here is a component of the unit vector

*x _{n+1} = x_{n} + u_{x}h*

y_{n+1} = y_{n} + u_{y}h

At this new value, calculate the function value again. Keep repeating this until the maximum or minimum is reached.

# Limitation

The approximation is generally only as good as the initial guess. Sometimes, a bad guess can lead to a solution that never converges. In that case, a different value must be used in order to find a converging solution. For example, if we have a fourth order polynomial function, a local extreme will exist. As x approaches –∞ or ∞, y will do the same. If I use the steepest descent and give a bad guess, my solution will go to ∞ instead of the local extreme.

# Sources

- Calculus by Gilbert Strang
- Applied Regression Analysis by Norman Draper and Harry Smith

## Mark Hagerman February 17, 2011

I’ve been playing with the steepest descent method for fitting a curve of the form: y = a + b*exp(cx) to a data set.

I select the step size (h) by computing the variance for the current parameters (a,b,c), and for the modified parameters at -hu and -2hu. Then I fit those values to a parabola (a la Simpson’s method), and use the results to calculate a revised step size.

It seems to work fairly well, though I’m still fine-tuning the code.

## hp February 19, 2011

I too found that the steepest descent method works well for fitting a curve since it’s hard take derivatives there are so many summations in the equations. I too programmed a way to do the steepest descent method, but unfortunately, trying to fit a circle using least squares was very difficult because a bad initial guess kept making my solution diverge.