Optimization

Computational Mathematics and Statistics Camp University of Chicago September 2018

Newton-Raphson root finding

Roots
Actual vs. approximate
Newton’s method

Taylor series expansion

Suppose $f:\Re \rightarrow \Re$, $f(x)$ is infinitely differentiable function. Then, the Taylor expansion of $f(x)$ around $a$ is given by

\[ \begin{eqnarray} f(x) & = & f(a) + \frac{f^{'}(a)}{1!} (x- a) + \frac{f^{''}(a)}{2!} (x - a)^2 + \frac{f^{'''}(a)}{3!}(x- a)^3 + \ldots \\ f(x) & = & \sum_{n=0}^{\infty } \frac{f^{n} (a) }{n!} (x - a)^n \end{eqnarray} \]

Root finding

If we want to find the root value $x_1$ based on a starting point of $x_0$

\[ \begin{eqnarray} 0 & \cong & f(x_0) + \frac{f^{'}(x_0)}{1} (x_1 - x_0) \end{eqnarray} \]

\[ \begin{eqnarray} x_1 & \cong & x_0 - \frac{f(x_0)}{f'(x_0)} \end{eqnarray} \]

Demonstration of Newton’s method

\[y = x^3 + 2x^2 - 3x + 4\]

\[\frac{\partial y}{\partial x} = 3x^2 + 4x - 3\]

\[\frac{\partial^2 y}{\partial x^2} = 6x + 4\]

Demonstration of Newton’s method

Implementing Newton-Raphson

\[x_1 = x_0 - \frac{f'(x_0)}{f''(x_0)}\]

\[x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}\]

\[x_{n+1} = x_n - \frac{3x_n^2 + 4x_n - 3}{6x_n + 4}\]

Implementing Newton-Raphson

Initial guess: $x_0 = 10$
- $f''(0.535) = 7.211$
- $f'(0.535) = 4.6\times 10^{-12}$
- Local minimum
New guess: $x_0 = -10$
- $f''(-1.869) = -7.211$
- $f'(-1.869) = 10^{-13}$
- Local maximum

Optimizing multivariate functions

Parameters $\boldsymbol{\beta} = (\beta_{1}, \beta_{2}, \ldots, \beta_{n} )$ such that $f(\boldsymbol{\beta}| \boldsymbol{X}, \boldsymbol{Y})$ is maximized
Policy $\boldsymbol{x} \in \Re^{n}$ that maximizes $U(\boldsymbol{x})$
Weights $\boldsymbol{\pi} = (\pi_{1}, \pi_{2}, \ldots, \pi_{K})$ such that a weighted average of forecasts $\boldsymbol{f} = (f_{1} , f_{2}, \ldots, f_{k})$ have minimum loss

\[ \begin{eqnarray} \min_{\boldsymbol{\pi}} & = & - (\sum_{j=1}^{K} \pi_{j} f_{j} - y ) ^ 2 \nonumber \end{eqnarray} \]
Analytic approach
Computational approach
- Grid search
- Multivariate Newton-Raphson
- Stochastic gradient descent

Differences from single variable optimization procedure

Same basic approach, except we have multiple parameters of interest
Requires more explicit knowledge of linear algebra to track all the components and optimize over the multidimensional space

Differences from single variable optimization procedure

Let $f:X \rightarrow \Re$ with $X \subset \Re^{n}$. A vector $\boldsymbol{x}^{*} \in X$ is a global maximum if , for all other $\boldsymbol{x} \in X$

\[ \begin{eqnarray} f(\boldsymbol{x}^{*}) & > & f(\boldsymbol{x} ) \nonumber \end{eqnarray} \]
A vector $\boldsymbol{x}^{\text{local}}$ is a local maximum if there is a neighborhood around $\boldsymbol{x}^{\text{local}}$, $Q \subset X$ such that, for all $x \in Q$,

\[ \begin{eqnarray} f(\boldsymbol{x}^{\text{local} }) & > & f(\boldsymbol{x} ) \end{eqnarray} \]

First derivative test: Gradient

Suppose $f:X \rightarrow \Re^{n}$ with $X \subset \Re^{1}$ is a differentiable function. Define the gradient vector of $f$ at $\boldsymbol{x}_{0}$, $\nabla f(\boldsymbol{x}_{0})$ as

\[ \begin{eqnarray} \nabla f (\boldsymbol{x}_{0}) & = & \left(\frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{1} }, \frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{2} }, \frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{3} }, \ldots, \frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{n} } \right) \end{eqnarray} \]
It is the first partial derivatives for each variable $x_n$ stored in a vector. So if $\boldsymbol{a} \in X$ is a local extremum, then,

\[ \begin{eqnarray} \nabla f(\boldsymbol{a}) & = & \boldsymbol{0} \\ & = & (0, 0, \ldots, 0) \end{eqnarray} \]

Example gradients

\[ \begin{eqnarray} f(x,y) &=& x^2+y^2 \\ \Delta f(x,y) &=& (2x, \, 2y) \end{eqnarray} \]

\[ \begin{eqnarray} f(x,y) &=& x^3 y^4 +e^x -\log y \\ \Delta f(x,y) &=& (3x^2 y^4 + e^x, \, 4x^3y^3 - \frac{1}{y}) \end{eqnarray} \]

Critical values

Maximum
Minimum
Saddle point

Second derivative test: Hessian

Suppose $f:X \rightarrow \Re^{1}$ , $X \subset \Re^{n}$, with $f$ a twice differentiable function. We will define the Hessian matrix as the matrix of second derivatives at $\boldsymbol{x}^{*} \in X$,

\[ \begin{eqnarray} \boldsymbol{H}(f)(\boldsymbol{x}^{*} ) & = & \begin{pmatrix} \frac{\partial^{2} f }{\partial x_{1} \partial x_{1} } (\boldsymbol{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{1} \partial x_{2} } (\boldsymbol{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{1} \partial x_{n} } (\boldsymbol{x}^{*} ) \\ \frac{\partial^{2} f }{\partial x_{2} \partial x_{1} } (\boldsymbol{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{2} \partial x_{2} } (\boldsymbol{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{2} \partial x_{n} } (\boldsymbol{x}^{*} ) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2} f }{\partial x_{n} \partial x_{1} } (\boldsymbol{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{n} \partial x_{2} } (\boldsymbol{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{n} \partial x_{n} } (\boldsymbol{x}^{*} ) \\ \end{pmatrix} \nonumber \end{eqnarray} \]
Requires differentiating on the entire gradient with respect to each $x_n$

Example Hessians

\[ \begin{eqnarray} f(x,y) &=& x^2+y^2 \\ \Delta f(x,y) &=& (2x, \, 2y) \\ \boldsymbol{H}(f)(x,y) &=& \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} \end{eqnarray} \]

\[ \begin{eqnarray} f(x,y) &=& x^3 y^4 +e^x -\log y \\ \Delta f(x,y) &=& (3x^2 y^4 + e^x, \, 4x^3y^3 - \frac{1}{y}) \\ \boldsymbol{H}(f)(x,y) &=& \begin{pmatrix} 6xy^4 + e^x & 12x^2y^3 \\ 12x^2y^3 & 12x^3y^2 + \frac{1}{y^2} \end{pmatrix} \end{eqnarray} \]

Definiteness of a matrix

Consider $n \times n$ matrix $\boldsymbol{A}$. If, for all $\boldsymbol{x} \in \Re^{n}$ where $\boldsymbol{x} \neq 0$:

\[ \begin{eqnarray} \boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x} & > & 0, \quad \text{ $\boldsymbol{A}$ is positive definite } \\ \boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x} & < & 0, \quad \text{ $\boldsymbol{A}$ is negative definite } \end{eqnarray} \]
If $\boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x} >0$ for some $\boldsymbol{x}$ and $\boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x}<0$ for other $\boldsymbol{x}$, then we say $\boldsymbol{A}$ is indefinite.

Second derivative test

If $\boldsymbol{H}(f)(\boldsymbol{a})$ is positive definite then $\boldsymbol{a}$ is a local minimum
If $\boldsymbol{H}(f)(\boldsymbol{a})$ is then $\boldsymbol{a}$ is a local maximum
If $\boldsymbol{H}(f)(\boldsymbol{a})$ is then $\boldsymbol{a}$ is a saddle point

Use the determinant to assess definiteness

\[ \begin{eqnarray} \boldsymbol{H}(f)(\boldsymbol{a}) & = & \begin{pmatrix} A & B \\ B & C \\ \end{pmatrix} \end{eqnarray} \]

$AC - B^2> 0$ and $A>0$ $\leadsto$ positive definite $\leadsto$ $\boldsymbol{a}$ is a local minimum
$AC - B^2> 0$ and $A<0$ $\leadsto$ negative definite $\leadsto$ $\boldsymbol{a}$ is a local maximum
$AC - B^2<0$ $\leadsto$ indefinite $\leadsto$ saddle point
$AC- B^2 = 0$ inconclusive

Basic procedure summarized

Calculate gradient
Set equal to zero, solve system of equations
Calculate Hessian
Assess Hessian at critical values
Boundary values? (if relevant)

A simple optimization example

Suppose $f:\Re^{2} \rightarrow \Re$ with

\[ \begin{eqnarray} f(x_{1}, x_{2}) & = & 3(x_1 + 2)^2 + 4(x_{2} + 4)^2 \nonumber \end{eqnarray} \]
Calculate gradient:

\[ \begin{eqnarray} \nabla f(\boldsymbol{x}) & = & (6 x_{1} + 12 , 8x_{2} + 32 ) \nonumber \\ \boldsymbol{0} & = & (6 x_{1}^{*} + 12 , 8x_{2}^{*} + 32 ) \nonumber \end{eqnarray} \]
We now solve the system of equations to yield

\[x_{1}^{*} = - 2, \quad x_{2}^{*} = -4\]

\[ \begin{eqnarray} \textbf{H}(f)(\boldsymbol{x}^{*}) & = & \begin{pmatrix} 6 & 0 \\ 0 & 8 \\ \end{pmatrix}\nonumber \end{eqnarray} \]

det$(\textbf{H}(f)(\boldsymbol{x}^{*}))$ = 48 and $6>0$ so $\textbf{H}(f)(\boldsymbol{x}^{*})$ is positive definite. $\boldsymbol{x^{*}}$ is a local minimum.

Maximum likelihood estimation for a normal distribution

Suppose that we draw an independent and identically distributed random sample of $n$ observations from a normal distribution,

\[ \begin{eqnarray} Y_{i} & \sim & \text{Normal}(\mu, \sigma^2) \\ \boldsymbol{Y} & = & (Y_{1}, Y_{2}, \ldots, Y_{n} ) \end{eqnarray} \]

Our task:

Obtain likelihood (summary estimator)
Derive maximum likelihood estimators for $\mu$ and $\sigma^2$

Identify likelihood function

\[ \begin{eqnarray} L(\mu, \sigma^2 | \boldsymbol{Y} ) & \propto & \prod_{i=1}^{n} f(Y_{i}|\mu, \sigma^2) \\ &\propto & \prod_{i=1}^{N} \frac{\exp[ - \frac{ (Y_{i} - \mu)^2 }{2\sigma^2} ]}{\sqrt{2 \pi \sigma^2}} \\ & \propto & \frac{\exp[ -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} ]}{ (2\pi)^{n/2} \sigma^{2n/2} } \end{eqnarray} \]

Log-likelihood function

\[ \begin{eqnarray} l(\mu, \sigma^2|\boldsymbol{Y} ) & = & -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} - \frac{n}{2} log(2 \pi) - \frac{n}{2} \log (\sigma^2) \end{eqnarray} \]

Maximize log-likelihood function

\[ \begin{eqnarray} l(\mu, \sigma^2|\boldsymbol{Y} ) & = & -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} - \frac{n}{2} \log (\sigma^2) \\ \frac{\partial l(\mu, \sigma^2)|\boldsymbol{Y} )}{\partial \mu } & = & \sum_{i=1}^{n} \frac{2(Y_{i} - \mu)}{2\sigma^2} \\ \frac{\partial l(\mu, \sigma^2)|\boldsymbol{Y})}{\partial \sigma^2} & = & -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (Y_{i} - \mu)^2 \end{eqnarray} \]

\[ \begin{eqnarray} 0 & = & -\sum_{i=1}^{n} \frac{2(Y_{i} - \widehat{\mu})}{2\widehat{\sigma}^2} \\ 0 & = & -\frac{n}{2\widehat{\sigma}^2 } + \frac{1}{2\widehat{\sigma}^4} \sum_{i=1}^{n} (Y_{i} - \mu^{*})^2 \end{eqnarray} \]

Solve for $\widehat{\mu}$ and $\widehat{\sigma}^2$

\[ \begin{eqnarray} \widehat{\mu} & = & \frac{\sum_{i=1}^{n} Y_{i} }{n} \\ \widehat{\sigma}^{2} & = & \frac{1}{n} \sum_{i=1}^{n} (Y_{i} - \overline{Y})^2 \end{eqnarray} \]

Hessian

\[ \textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2) = \begin{pmatrix} \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial \mu^{2}} & \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial \sigma^{2} \partial \mu} \\ \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial \sigma^{2} \partial \mu} & \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial^{2} \sigma^{2}} \end{pmatrix} \]

\[ \begin{eqnarray} \textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2) & = & \begin{pmatrix} \frac{-n}{\widehat{\sigma}^2} & 0 \\ 0 & \frac{-n}{2(\widehat{\sigma}^2)^2} \nonumber \\ \end{pmatrix} \end{eqnarray} \]

$\text{det}(\textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2)) = \dfrac{n^2}{2(\widehat{\sigma}^2)^3} > 0$ and $A = \dfrac{-n}{\widehat{\sigma}^2} < 0$ $\leadsto$ maximum

Computational optimization procedures

Analytical approaches can be difficult/impossible
Computational approaches simplify the problem
Different algorithms available with benefits/drawbacks
- Grid search - tedious
- Newton-Raphson - expensive
- Gradient descent - parallelizable

Grid search

In R, drew 10,000 realizations from $Y_{i} \sim \text{Normal}(0.25, 100)$
Used realized values $y_{i}$ to evaluate $l(\mu, \sigma^2| \boldsymbol{y} )$ across a range of values
Computationally inefficient

Multivariate Newton-Raphson

Suppose $f:\Re^{n} \rightarrow \Re$. Suppose we have guess $\boldsymbol{x}_{t}$. Then our update is:

\[ \begin{eqnarray} \boldsymbol{x}_{t+1} & = & \boldsymbol{x}_{t} - \textbf{H}(f)(\boldsymbol{x}_{t})^{-1} \nabla f(\boldsymbol{x}_{t}) \end{eqnarray} \]

Approximate function with tangent plane
Find value of $x_{t+1}$ that makes the plane equal to zero
Update again
Expensive to calculate (requires inverting Hessian)
Very sensitive to starting points

Gradient-based optimization

output = relu(dot(W, input) + B)

input
dot()
relu()
W, b
Adjust the weights to produce the lowest loss score
Initialize the weights with random values that are not useful

Training loop

Draw a batch of training samples x and targets y
Forward pass - run the network on x to obtain predictions y_pred
Compute the loss of the network on the batch (mismatch between y_pred and y)
Update the weights of the network in a way that slightly reduces the loss on this batch

Naive solution

Freeze all weights except one scalar coefficient and try different values for that coefficient
Horribly inefficient
Use differentiation and compute the gradient of the loss with regard to the network’s coefficients
Move the coefficients in the opposite direction from the gradient, thus decreasing the loss

Derivative

Derivative of a tensor operation

y_pred = dot(W, x)
loss_value = loss(y_pred, y)

loss_value = f(W)

loss_value = f(W0)
f'(W0)

Stochastic gradient descent

Draw a batch of training samples x and targets y
Forward pass - run the network on x to obtain predictions y_pred
Compute the loss of the network on the batch (mismatch between y_pred and y)
Backward pass - compute the gradient of the loss with regard to the network’s parameters
Move the parameters a little in the opposite direction from the gradient to reduce the loss for the batch

Optimization

Computational Mathematics and Statistics Camp University of Chicago September 2018

Newton-Raphson root finding

Taylor series expansion

Root finding

Demonstration of Newton’s method

Demonstration of Newton’s method

Demonstration of Newton’s method

Implementing Newton-Raphson

Implementing Newton-Raphson

Optimizing multivariate functions

Differences from single variable optimization procedure

Differences from single variable optimization procedure

First derivative test: Gradient

Example gradients

Critical values

Second derivative test: Hessian

Example Hessians

Definiteness of a matrix

Second derivative test

Use the determinant to assess definiteness

Basic procedure summarized

A simple optimization example

Maximum likelihood estimation for a normal distribution

Identify likelihood function

Log-likelihood function

Maximize log-likelihood function

Solve for \(\widehat{\mu}\) and \(\widehat{\sigma}^2\)

Hessian

Computational optimization procedures

Grid search

Grid search

Multivariate Newton-Raphson

Gradient-based optimization

Training loop

Naive solution

Derivative

Derivative of a tensor operation

Stochastic gradient descent

Stochastic gradient descent

Stochastic gradient descent

Momentum