Optimization

Computational Mathematics and Statistics Camp University of Chicago September 2018

Newton-Raphson root finding

  • Roots
  • Actual vs. approximate
  • Newton’s method

Taylor series expansion

Suppose \(f:\Re \rightarrow \Re\), \(f(x)\) is infinitely differentiable function. Then, the Taylor expansion of \(f(x)\) around \(a\) is given by

\[ \begin{eqnarray} f(x) & = & f(a) + \frac{f^{'}(a)}{1!} (x- a) + \frac{f^{''}(a)}{2!} (x - a)^2 + \frac{f^{'''}(a)}{3!}(x- a)^3 + \ldots \\ f(x) & = & \sum_{n=0}^{\infty } \frac{f^{n} (a) }{n!} (x - a)^n \end{eqnarray} \]

Root finding

  • If we want to find the root value \(x_1\) based on a starting point of \(x_0\)

    \[ \begin{eqnarray} 0 & \cong & f(x_0) + \frac{f^{'}(x_0)}{1} (x_1 - x_0) \end{eqnarray} \]

    \[ \begin{eqnarray} x_1 & \cong & x_0 - \frac{f(x_0)}{f'(x_0)} \end{eqnarray} \]

Demonstration of Newton’s method

\[y = x^3 + 2x^2 - 3x + 4\]

\[\frac{\partial y}{\partial x} = 3x^2 + 4x - 3\]

\[\frac{\partial^2 y}{\partial x^2} = 6x + 4\]

Demonstration of Newton’s method

Demonstration of Newton’s method

Implementing Newton-Raphson

\[x_1 = x_0 - \frac{f'(x_0)}{f''(x_0)}\]

\[x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}\]

\[x_{n+1} = x_n - \frac{3x_n^2 + 4x_n - 3}{6x_n + 4}\]

Implementing Newton-Raphson

  • Initial guess: \(x_0 = 10\)

    • \(f''(0.535) = 7.211\)
    • \(f'(0.535) = 4.6\times 10^{-12}\)
    • Local minimum
  • New guess: \(x_0 = -10\)

    • \(f''(-1.869) = -7.211\)
    • \(f'(-1.869) = 10^{-13}\)
    • Local maximum

Optimizing multivariate functions

  • Parameters \(\boldsymbol{\beta} = (\beta_{1}, \beta_{2}, \ldots, \beta_{n} )\) such that \(f(\boldsymbol{\beta}| \boldsymbol{X}, \boldsymbol{Y})\) is maximized
  • Policy \(\boldsymbol{x} \in \Re^{n}\) that maximizes \(U(\boldsymbol{x})\)
  • Weights \(\boldsymbol{\pi} = (\pi_{1}, \pi_{2}, \ldots, \pi_{K})\) such that a weighted average of forecasts \(\boldsymbol{f} = (f_{1} , f_{2}, \ldots, f_{k})\) have minimum loss

    \[ \begin{eqnarray} \min_{\boldsymbol{\pi}} & = & - (\sum_{j=1}^{K} \pi_{j} f_{j} - y ) ^ 2 \nonumber \end{eqnarray} \]

  • Analytic approach
  • Computational approach
    • Grid search
    • Multivariate Newton-Raphson
    • Stochastic gradient descent

Differences from single variable optimization procedure

  • Same basic approach, except we have multiple parameters of interest
  • Requires more explicit knowledge of linear algebra to track all the components and optimize over the multidimensional space

Differences from single variable optimization procedure

  • Let \(f:X \rightarrow \Re\) with \(X \subset \Re^{n}\). A vector \(\boldsymbol{x}^{*} \in X\) is a global maximum if , for all other \(\boldsymbol{x} \in X\)

    \[ \begin{eqnarray} f(\boldsymbol{x}^{*}) & > & f(\boldsymbol{x} ) \nonumber \end{eqnarray} \]

  • A vector \(\boldsymbol{x}^{\text{local}}\) is a local maximum if there is a neighborhood around \(\boldsymbol{x}^{\text{local}}\), \(Q \subset X\) such that, for all \(x \in Q\),

    \[ \begin{eqnarray} f(\boldsymbol{x}^{\text{local} }) & > & f(\boldsymbol{x} ) \end{eqnarray} \]

First derivative test: Gradient

  • Suppose \(f:X \rightarrow \Re^{n}\) with \(X \subset \Re^{1}\) is a differentiable function. Define the gradient vector of \(f\) at \(\boldsymbol{x}_{0}\), \(\nabla f(\boldsymbol{x}_{0})\) as

    \[ \begin{eqnarray} \nabla f (\boldsymbol{x}_{0}) & = & \left(\frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{1} }, \frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{2} }, \frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{3} }, \ldots, \frac{\partial f (\boldsymbol{x}_{0}) }{\partial x_{n} } \right) \end{eqnarray} \]

  • It is the first partial derivatives for each variable \(x_n\) stored in a vector. So if \(\boldsymbol{a} \in X\) is a local extremum, then,

    \[ \begin{eqnarray} \nabla f(\boldsymbol{a}) & = & \boldsymbol{0} \\ & = & (0, 0, \ldots, 0) \end{eqnarray} \]

Example gradients

\[ \begin{eqnarray} f(x,y) &=& x^2+y^2 \\ \Delta f(x,y) &=& (2x, \, 2y) \end{eqnarray} \]

\[ \begin{eqnarray} f(x,y) &=& x^3 y^4 +e^x -\log y \\ \Delta f(x,y) &=& (3x^2 y^4 + e^x, \, 4x^3y^3 - \frac{1}{y}) \end{eqnarray} \]

Critical values

  1. Maximum
  2. Minimum
  3. Saddle point

Second derivative test: Hessian

  • Suppose \(f:X \rightarrow \Re^{1}\) , \(X \subset \Re^{n}\), with \(f\) a twice differentiable function. We will define the Hessian matrix as the matrix of second derivatives at \(\boldsymbol{x}^{*} \in X\),

    \[ \begin{eqnarray} \boldsymbol{H}(f)(\boldsymbol{x}^{*} ) & = & \begin{pmatrix} \frac{\partial^{2} f }{\partial x_{1} \partial x_{1} } (\boldsymbol{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{1} \partial x_{2} } (\boldsymbol{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{1} \partial x_{n} } (\boldsymbol{x}^{*} ) \\ \frac{\partial^{2} f }{\partial x_{2} \partial x_{1} } (\boldsymbol{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{2} \partial x_{2} } (\boldsymbol{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{2} \partial x_{n} } (\boldsymbol{x}^{*} ) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2} f }{\partial x_{n} \partial x_{1} } (\boldsymbol{x}^{*} ) & \frac{\partial^{2} f }{\partial x_{n} \partial x_{2} } (\boldsymbol{x}^{*} ) & \ldots & \frac{\partial^{2} f }{\partial x_{n} \partial x_{n} } (\boldsymbol{x}^{*} ) \\ \end{pmatrix} \nonumber \end{eqnarray} \]

  • Requires differentiating on the entire gradient with respect to each \(x_n\)

Example Hessians

\[ \begin{eqnarray} f(x,y) &=& x^2+y^2 \\ \Delta f(x,y) &=& (2x, \, 2y) \\ \boldsymbol{H}(f)(x,y) &=& \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix} \end{eqnarray} \]

\[ \begin{eqnarray} f(x,y) &=& x^3 y^4 +e^x -\log y \\ \Delta f(x,y) &=& (3x^2 y^4 + e^x, \, 4x^3y^3 - \frac{1}{y}) \\ \boldsymbol{H}(f)(x,y) &=& \begin{pmatrix} 6xy^4 + e^x & 12x^2y^3 \\ 12x^2y^3 & 12x^3y^2 + \frac{1}{y^2} \end{pmatrix} \end{eqnarray} \]

Definiteness of a matrix

  • Consider \(n \times n\) matrix \(\boldsymbol{A}\). If, for all \(\boldsymbol{x} \in \Re^{n}\) where \(\boldsymbol{x} \neq 0\):

    \[ \begin{eqnarray} \boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x} & > & 0, \quad \text{ $\boldsymbol{A}$ is positive definite } \\ \boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x} & < & 0, \quad \text{ $\boldsymbol{A}$ is negative definite } \end{eqnarray} \]

  • If \(\boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x} >0\) for some \(\boldsymbol{x}\) and \(\boldsymbol{x}^{'} \boldsymbol{A} \boldsymbol{x}<0\) for other \(\boldsymbol{x}\), then we say \(\boldsymbol{A}\) is indefinite.

Second derivative test

  • If \(\boldsymbol{H}(f)(\boldsymbol{a})\) is positive definite then \(\boldsymbol{a}\) is a local minimum
  • If \(\boldsymbol{H}(f)(\boldsymbol{a})\) is then \(\boldsymbol{a}\) is a local maximum
  • If \(\boldsymbol{H}(f)(\boldsymbol{a})\) is then \(\boldsymbol{a}\) is a saddle point

Use the determinant to assess definiteness

\[ \begin{eqnarray} \boldsymbol{H}(f)(\boldsymbol{a}) & = & \begin{pmatrix} A & B \\ B & C \\ \end{pmatrix} \end{eqnarray} \]

  • \(AC - B^2> 0\) and \(A>0\) \(\leadsto\) positive definite \(\leadsto\) \(\boldsymbol{a}\) is a local minimum
  • \(AC - B^2> 0\) and \(A<0\) \(\leadsto\) negative definite \(\leadsto\) \(\boldsymbol{a}\) is a local maximum
  • \(AC - B^2<0\) \(\leadsto\) indefinite \(\leadsto\) saddle point
  • \(AC- B^2 = 0\) inconclusive

Basic procedure summarized

  1. Calculate gradient
  2. Set equal to zero, solve system of equations
  3. Calculate Hessian
  4. Assess Hessian at critical values
  5. Boundary values? (if relevant)

A simple optimization example

  • Suppose \(f:\Re^{2} \rightarrow \Re\) with

    \[ \begin{eqnarray} f(x_{1}, x_{2}) & = & 3(x_1 + 2)^2 + 4(x_{2} + 4)^2 \nonumber \end{eqnarray} \]

  • Calculate gradient:

    \[ \begin{eqnarray} \nabla f(\boldsymbol{x}) & = & (6 x_{1} + 12 , 8x_{2} + 32 ) \nonumber \\ \boldsymbol{0} & = & (6 x_{1}^{*} + 12 , 8x_{2}^{*} + 32 ) \nonumber \end{eqnarray} \]

  • We now solve the system of equations to yield

    \[x_{1}^{*} = - 2, \quad x_{2}^{*} = -4\]

    \[ \begin{eqnarray} \textbf{H}(f)(\boldsymbol{x}^{*}) & = & \begin{pmatrix} 6 & 0 \\ 0 & 8 \\ \end{pmatrix}\nonumber \end{eqnarray} \]

    det\((\textbf{H}(f)(\boldsymbol{x}^{*}))\) = 48 and \(6>0\) so \(\textbf{H}(f)(\boldsymbol{x}^{*})\) is positive definite. \(\boldsymbol{x^{*}}\) is a local minimum.

Maximum likelihood estimation for a normal distribution

Suppose that we draw an independent and identically distributed random sample of \(n\) observations from a normal distribution,

\[ \begin{eqnarray} Y_{i} & \sim & \text{Normal}(\mu, \sigma^2) \\ \boldsymbol{Y} & = & (Y_{1}, Y_{2}, \ldots, Y_{n} ) \end{eqnarray} \]

Our task:

  • Obtain likelihood (summary estimator)
  • Derive maximum likelihood estimators for \(\mu\) and \(\sigma^2\)

Identify likelihood function

\[ \begin{eqnarray} L(\mu, \sigma^2 | \boldsymbol{Y} ) & \propto & \prod_{i=1}^{n} f(Y_{i}|\mu, \sigma^2) \\ &\propto & \prod_{i=1}^{N} \frac{\exp[ - \frac{ (Y_{i} - \mu)^2 }{2\sigma^2} ]}{\sqrt{2 \pi \sigma^2}} \\ & \propto & \frac{\exp[ -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} ]}{ (2\pi)^{n/2} \sigma^{2n/2} } \end{eqnarray} \]

Log-likelihood function

\[ \begin{eqnarray} l(\mu, \sigma^2|\boldsymbol{Y} ) & = & -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} - \frac{n}{2} log(2 \pi) - \frac{n}{2} \log (\sigma^2) \end{eqnarray} \]

Maximize log-likelihood function

\[ \begin{eqnarray} l(\mu, \sigma^2|\boldsymbol{Y} ) & = & -\sum_{i=1}^{n} \frac{(Y_{i} - \mu)^2}{2\sigma^2} - \frac{n}{2} \log (\sigma^2) \\ \frac{\partial l(\mu, \sigma^2)|\boldsymbol{Y} )}{\partial \mu } & = & \sum_{i=1}^{n} \frac{2(Y_{i} - \mu)}{2\sigma^2} \\ \frac{\partial l(\mu, \sigma^2)|\boldsymbol{Y})}{\partial \sigma^2} & = & -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^{n} (Y_{i} - \mu)^2 \end{eqnarray} \]

\[ \begin{eqnarray} 0 & = & -\sum_{i=1}^{n} \frac{2(Y_{i} - \widehat{\mu})}{2\widehat{\sigma}^2} \\ 0 & = & -\frac{n}{2\widehat{\sigma}^2 } + \frac{1}{2\widehat{\sigma}^4} \sum_{i=1}^{n} (Y_{i} - \mu^{*})^2 \end{eqnarray} \]

Solve for \(\widehat{\mu}\) and \(\widehat{\sigma}^2\)

\[ \begin{eqnarray} \widehat{\mu} & = & \frac{\sum_{i=1}^{n} Y_{i} }{n} \\ \widehat{\sigma}^{2} & = & \frac{1}{n} \sum_{i=1}^{n} (Y_{i} - \overline{Y})^2 \end{eqnarray} \]

Hessian

\[ \textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2) = \begin{pmatrix} \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial \mu^{2}} & \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial \sigma^{2} \partial \mu} \\ \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial \sigma^{2} \partial \mu} & \frac{\partial^{2} l(\mu, \sigma^2|\boldsymbol{Y} )}{\partial^{2} \sigma^{2}} \end{pmatrix} \]

\[ \begin{eqnarray} \textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2) & = & \begin{pmatrix} \frac{-n}{\widehat{\sigma}^2} & 0 \\ 0 & \frac{-n}{2(\widehat{\sigma}^2)^2} \nonumber \\ \end{pmatrix} \end{eqnarray} \]

  • \(\text{det}(\textbf{H}(f)(\widehat{\mu}, \widehat{\sigma}^2)) = \dfrac{n^2}{2(\widehat{\sigma}^2)^3} > 0\) and \(A = \dfrac{-n}{\widehat{\sigma}^2} < 0\) \(\leadsto\) maximum

Computational optimization procedures

  • Analytical approaches can be difficult/impossible
  • Computational approaches simplify the problem
  • Different algorithms available with benefits/drawbacks
    • Grid search - tedious
    • Newton-Raphson - expensive
    • Gradient descent - parallelizable

Grid search

  • In R, drew 10,000 realizations from \(Y_{i} \sim \text{Normal}(0.25, 100)\)
  • Used realized values \(y_{i}\) to evaluate \(l(\mu, \sigma^2| \boldsymbol{y} )\) across a range of values
  • Computationally inefficient

Multivariate Newton-Raphson

Suppose \(f:\Re^{n} \rightarrow \Re\). Suppose we have guess \(\boldsymbol{x}_{t}\). Then our update is:

\[ \begin{eqnarray} \boldsymbol{x}_{t+1} & = & \boldsymbol{x}_{t} - \textbf{H}(f)(\boldsymbol{x}_{t})^{-1} \nabla f(\boldsymbol{x}_{t}) \end{eqnarray} \]

  • Approximate function with tangent plane
  • Find value of \(x_{t+1}\) that makes the plane equal to zero
  • Update again
  • Expensive to calculate (requires inverting Hessian)
  • Very sensitive to starting points

Gradient-based optimization

output = relu(dot(W, input) + B)
  • input
  • dot()
  • relu()
  • W, b

  • Adjust the weights to produce the lowest loss score
  • Initialize the weights with random values that are not useful

Training loop

  1. Draw a batch of training samples x and targets y
  2. Forward pass - run the network on x to obtain predictions y_pred
  3. Compute the loss of the network on the batch (mismatch between y_pred and y)
  4. Update the weights of the network in a way that slightly reduces the loss on this batch

Naive solution

  • Freeze all weights except one scalar coefficient and try different values for that coefficient
  • Horribly inefficient
  • Use differentiation and compute the gradient of the loss with regard to the network’s coefficients
  • Move the coefficients in the opposite direction from the gradient, thus decreasing the loss

Derivative

Derivative of a tensor operation

y_pred = dot(W, x)
loss_value = loss(y_pred, y)
loss_value = f(W)
loss_value = f(W0)
f'(W0)

Stochastic gradient descent

  1. Draw a batch of training samples x and targets y
  2. Forward pass - run the network on x to obtain predictions y_pred
  3. Compute the loss of the network on the batch (mismatch between y_pred and y)
  4. Backward pass - compute the gradient of the loss with regard to the network’s parameters
  5. Move the parameters a little in the opposite direction from the gradient to reduce the loss for the batch

Stochastic gradient descent

Stochastic gradient descent

Momentum