This is a work-in-progress page of Lonnie and Tom Keelin. The goal here explore the gradients of the Keelin distribution and fast algorithms for maximum likelihood fitting of the Keelin distribution. As a work in progress, the results on this page may very well have mistakes. This is actually the gradient of the log likelihood for any linear quantile parameterized function, Keelin being a special case, assuming I haven't make a mistake.

## Notation

• $x_i$ = data points, $i=1..m$
• $\vec{a}$ = quantile coefficients, $[a_1, ..., a_k, ..., a_N]$
• $\vec{B}(y)$ = Basis at scalar y, where $0<y<1$ is a probability
$= [B_1(y), B_2(y), ..., B_k(y),...B_N(y)]$
• $\vec{b}(y)$ = the derivative basis, $[b_1(y), b_2(y), ..., b_k(y),...b_N(y)]$
$b_k(y) = {{\partial B_k(y)}\over{\partial y}}$
• $\vec{b'}(y)$ = the second derivative basis, $[b'_1(y), b'_2(y), ..., b'_k(y),...b'_N(y)]$
$b'_k(y) = {{\partial b_k(y)}\over{\partial y}}$
• $M(y) = \vec{a} \cdot \vec{B}(y)$ = y-th quantile
• $m^{-1}(y) = {{\partial M(y)}\over {\partial y}} = \vec{a} \cdot \vec{b}(y)$ = inverse density
• $m(y)$ = probability density at the yth quantile
• $s(y) = \vec{a} \cdot \vec{b'}(y)$
• $s'(y) = \vec{a} \cdot\vec{b''}(y)$
• $y(x_i)$ = the cumulative probability at $x_i$, given an implied $\vec{a}$, not shown explicitly.
• $\nabla_a f = \langle {{\partial f}\over{\partial a_1}},..,{{\partial f}\over{\partial a_k}},..,{{\partial f}\over{\partial a_N}} \rangle$ = the gradient of a function $f$ wrt $\vec{a}$
• $LL$ = The log likelihood of the data -- its a function of $\vec{a}$ and the data $[x_1,...,x_m]$.
= $\ln \left( \prod_i p( x_i | \vec{a} ) \right) = \sum_i \ln p(x_i|\vec{a})$
• $p( x_i | \vec{a} ) = m( y(x_i) )$ = The probability density of the point $x_i$

All vectors with an explicit arrow-hat, e.g., $\vec{a}$, are indexed by k. All gradients with respect to a are also indexed by k, since a is. Dot product is shown explicitly, e.g., $\vec{a} \cdot \vec{b}(y)$, in which the result is not indexed by k. Component-wise multiplication is shown without a dot, e.g., $\vec{a} \vec{b}(y)$, in which the result is indexed by k.

## The Keelin basis

The derivation and algorithm that follows does not rely on any specific form of the basis. I'll just mention that for the Keelin distribution, the basis is

$B_k(y) = \left\{ \begin{array}{ll} (y-0.5)^{\lfloor{k/2}\rfloor} & k=0, 3, \mbox{even } k\ge 4 \\ (y-0.5)^{\lfloor{k/2}\rfloor} logit(y) & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right.$

The derivative basis is

$b_k(y) = \left\{ \begin{array}{ll} {\lfloor{k/2}\rfloor}\left(y-\dfrac{1}{2}\right)^{{\lfloor{k/2}\rfloor}-1} & k=0, 3, \mbox{even } k\ge 4 \\ \dfrac{\left(y-\frac{1}{2}\right)^{\lfloor{k/2}\rfloor}\left(\left(2{\lfloor{k/2}\rfloor}y^2-2{\lfloor{k/2}\rfloor}y\right) logit(y)-2y+1\right)}{\left(y-1\right)y\left(2y-1\right)} & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right.$

The second derivative basis is

$b'_k(y) = \left\{ \begin{array}{ll} \left({\lfloor{k/2}\rfloor}-1\right) {\lfloor{k/2}\rfloor} \left(y-\dfrac{1}{2}\right)^{ {\lfloor{k/2}\rfloor} -2} & k=0, 3, \mbox{even } k\ge 4 \\ \dfrac{\left(y-\frac{1}{2}\right)^{\lfloor{k/2}\rfloor} \left[ g_1(y,{\lfloor{k/2}\rfloor}) logit(y)+g_2(y,{\lfloor{k/2}\rfloor})\right]}{\left(y-1\right)^2y^2\left(2y-1\right)^2} & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right.$

The third derivative basis (appears in Hessian) is

$b''_k(y) = \left\{ \begin{array}{ll} \left({\lfloor{k/2}\rfloor}-2\right) \left({\lfloor{k/2}\rfloor}-1\right) {\lfloor{k/2}\rfloor} \left(y-\dfrac{1}{2}\right)^{ {\lfloor{k/2}\rfloor} -3} & k=0, 3, \mbox{even } k\ge 4 \\ {{2\left(y-\dfrac{1}{2}\right)^{\lfloor{k/2}\rfloor} \left[ g_3\left(y,{\lfloor{k/2}\rfloor}\right) logit(y) + g_4(y,{\lfloor{k/2}\rfloor}) \right] } \over {(y-1)^3 y^3 (2y-1)^3}} & k=1, 2, \mbox{odd } k\ge 5 \end{array}\right.$

where

$g_1(y,c) = \left(4 c^2 - 4 c \right)y^4+\left(8 c - 8 c^2\right)y^3+\left(4 c^2-4 c\right)y^2$
$g_2(y,c) = \left(8-8 c\right)y^3+\left(12 c-12\right)y^2+\left(6-4 c\right)y-1$
$g_3(y,c) = (4 c^3 - 12 c^2 + 8 c)(y^6 - 3 y^5 + 3 y^4 - y^3)$
$g_4(y,c) = (-12 c^2 + 36 c - 24)y^5 + (30c^2 - 90c + 60)y^4 + (-24c^2 + 78 c - 62) y^3 + (6 c^2 - 27c + 33) y^2 + (3c-9) y + 1$
$\lfloor{k/2}\rfloor$ is the floor of $k/2$ -- the largest integer that is less than or equal to $k/2$.
$logit(y) = \ln\left( {y\over{1-y}} \right)$

$\begin{split} \nabla_a LL &= \nabla_a \sum_i \ln p(x_i|\vec{a}) \\ &= \sum_i \nabla_a \ln m(y(x_i)) \\ &= \sum_i m^{-1}(y(x_i)) \nabla_a m(y(x_i)) \\ \end{split}$
$\begin{split} \nabla_a m(y(x_i)) &= \nabla_a \left(m^{-1}(y(x_i))\right)^{-1} \\ &= - \left( m^{-1}(y(x_i)) \right)^{-2} \nabla_a m^{-1}(y(x_i)) \\ &= - m^2(y(x_i)) \nabla_a \vec{a}\cdot \vec{b}(y(x_i)) \\ &= - m^2(y(x_i)) \left( \vec{b}(y(x_i)) + \left[\vec{a} \cdot \vec{b'}(y(x_i))\right] \nabla_a y(x_i) \right) \end{split}$
$\begin{split} \nabla_a LL &= -\sum_i m(y(x_i)) \left( \vec{b}(y(x_i)) + s(y(x_i)) \nabla_a y(x_i) \right) \end{split}$

where $s(y) = \vec{a} \cdot \vec{b'}(y)$.

To obtain $\nabla_a y(x_i)$, use

$\begin{split} x_i &= M(y_i) = \vec{a} \cdot \vec{B}(y_i) \\ 0 &= {\partial\over{\partial a_k}} \vec{a} \cdot \vec{B}(y_i) \\ &= \vec{B}_k(y_i) + [\vec{a} \cdot \vec{b}_k(y_i)] {{\partial y_i}\over{\partial a_k}} \\ &= \vec{B}_k(y_i) + m^{-1}(y_i) {{\partial y_i}\over{\partial a_k}} \\ {{\partial y_i}\over{\partial a_k}} &= - m(y_i) \vec{B}_k(y_i) \\ \nabla_a y_i &= -m(y_i) \vec{B}(y_i) \end{split}$

So

$\nabla_a LL = -\sum_i m(y_i) \left( \vec{b}(y_i) - s(y_i) m(y_i) \vec{B}(y_i) \right)$

where $y_i = y(x_i)$ is found by solving $x_i = M(y_i)$ by Newton-Raphson.

### Old derivation of $\nabla_a y(x_i)$

(I'm keeping this around for a while for reference. Tom pointed out the first term is zero, which lead to a simpler derivation and simpler form).

The gradient $\nabla_a y(x_i)$ is the interesting part. $y(x_i)$ is not an easy function -- there is no analytical form for it. In Analytica 5, the CumKeelin and DensKeelin functions compute $y(x_i)$ using the Newton-Raphson method, which is fast, but not closed form by any means.

However, assume we have computed $\tilde{y} = y(x_i)$, for example by using Newton-Raphson's method. Recall that the Newton-Raphson method uses the update rule

$z_{t+1} = z_t - {{f(z) - v}\over{f'(z)}}$

such that with each iteration it moves closer to the z value that makes $f(z)=v$.

So, suppose we change $a_k$ by an epsilon amount, ${\partial a_k}$, so that $\tilde{y}$ is a pretty close approximation to the new $y(x_i)$. Since our change was small, I'll suppose we can obtain $y(x_i)$ by executing one iteration of Newton-Raphson's method starting at $\tilde{y_i}$.

$y(x_i) = \tilde{y_i} - {{(M(\tilde{y_i}) - x_i)}\over{m^{-1}(\tilde{y_i})}}$

The denominator follows from the fact that ${{\partial M(y)}\over{\partial y}} = m^{-1}(y)$. Using this,

$\begin{split} {{\partial y(x_i)}\over{\partial a_k}} &= - { { m^{-1}(\tilde{y_i}) {{\partial M(\tilde{y})}\over{\partial a_k}} - \left(M(\tilde{y_i})-x_i\right) {{\partial m^{-1}(\tilde{y_i})}\over{\partial a_k}}} \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ &= - { { m^{-1}(\tilde{y_i}) B_k(\tilde{y}) - \left(M(\tilde{y_i})-x_i\right) b_k(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ \nabla_a y(x_i) &= { { \left(M(\tilde{y_i})-x_i\right) \vec{b}(\tilde{y_i}) - m^{-1}(\tilde{y_i}) \vec{B}(\tilde{y_i}) } \over { \left[ m^{-1}(\tilde{y_i}) \right]^2 } }\\ &= { { \left(M(\tilde{y_i})-x_i\right) \vec{b}(\tilde{y_i}) - m^{-1}(\tilde{y_i}) \vec{B}(\tilde{y_i}) } \over { \left[ \vec{a} \cdot \vec{b}(\tilde{y_i}) \right]^2} } \end{split}$

Status: The gradient equations above now agree with finite differencing on $a_k$, so I'm pretty confident they are now right.

1. Given data $x_i$, obtain a starting $\vec{a}$ using ordinary least squares in quantile space.
2. Use Newton-Raphson to compute $y_i = y(x_i)$ for each datum, given the current $\vec{a}$.
3. Compute $\nabla_a LL$
4. Take a step in the direction of the gradient, e.g.,
$\vec{a}_{new} = \vec{a} + \epsilon \nabla_a LL$
where $\epsilon$ is the learning rate
5. If converged, stop
6. Go to Step 2, but use the values of $y_i$ as the starting point for the next Newton-Raphson iteration.

## Hessian Derivation

(Derived by Tom)

With the Hessian, it should be possible to use multidimensional Newton-Raphson (i.e., find the zero of the gradient) to compute $\vec{a}$ quickly.

$\nabla^2_a LL = \left[ {{\partial^2 LL} \over {\partial a_j \partial a_k}} \right]$

The next line follows from differentiating $m = \left(\vec{a}\cdot\vec{b}\right)^{-1}$

${{\partial m}\over{\partial a_k}} = -m(y)^2 (- s(y) m(y) B_k + b_k)$

For conciseness, I'll write $y$ for $y_i = y(x_i)$. Let $t_i$ denote the $j^{th}$ term for the $i^{th}$ data point of $\nabla_a LL$

$t_i = m(y) b_j(y) - \left( \vec{a}\cdot \vec{b'}(y) \right) m^2(y) B_j(y) = m(y) b_j(y) - s(y) B_j(y)$

So

${{\partial^2 LL} \over {\partial a_j \partial a_k}} = -{\partial \over {\partial a_k}} \sum_i t_i = -\sum_i {{\partial t_i}\over{\partial a_k}}$

The derivation is:

${{\partial t_i}\over{\partial a_k}} = {{\partial m}\over{\partial a_k}} b_j + m(y) b'_j(y) {{\partial y}\over{\partial a_k}} - s(y) m(y)^2 b_j(y) {{\partial y}\over{\partial a_k}} - 2 s(y) m(y) {{\partial m}\over{\partial a_k}} B_j(y) - {{\partial s(y)}\over{\partial a_k}} m(y)^2 B_j(y)$

A previous derivation is as follows (Note: After correcting a mistake in sign, I believe this is identical to the preceding one):

$\begin{split} {{\partial t_i}\over{\partial a_k}} &= m(y) \vec{b'}(y) {{\partial y}\over{\partial a_k}} &+ m'(y) b_j(y) {{\partial y}\over{\partial a_k}} \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) m^2(y) b_j(y) {{\partial y}\over{\partial a_k}} \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) 2 m(y) m'(y) {{\partial y}\over{\partial a_k}} B_j(y) \\ &&- \left( \vec{a}\cdot \vec{b''}(y) {{\partial y}\over{\partial a_k}} + b'_k(y) \right) m^2(y) B_j(y) \end{split}$
$\begin{split} {{\partial t_i}\over{\partial a_k}} &= {{\partial y}\over{\partial a_k}} \left[ m(y) \vec{b'}(y)\right. &+ m'(y) b_j(y) \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) m^2(y) b_j(y) \\ &&- \left( \vec{a}\cdot \vec{b'}(y) \right) 2 m(y) m'(y) B_j(y) \\ &&- \left. \left( \vec{a}\cdot \vec{b''}(y) \right) m^2(y) B_j(y) \right] - b'_k(y) m^2(y) B_j(y) \end{split}$

Using ${{\partial y}\over{\partial a_k}} = -m(y) \vec{B}_k(y)$, we have

$\begin{split} {{\partial^2 LL} \over {\partial a_j \partial a_k}} &= \sum_i \left( -m(y) \vec{B}_k(y) \left[ m(y) \vec{b'}(y)\right.\right. &+ m'(y) b_j(y) \\ &&- s(y) m^2(y) b_j(y) \\ &&- 2 s(y) m(y) m'(y) B_j(y) \\ &&- \left.\left. s'(y) m^2(y) B_j(y) \right] - m^2(y) b'_k(y) B_j(y) \right) \end{split}$

where $s'(y) = \vec{a}\cdot \vec{b''}(y)$

## Future work

• The Hessian makes it possible to use Newton-Raphson for the whole MaxLL fit (faster convergence than simple gradient descent).
• Can we solve this as an constrained optimization problem (LP) using both a and y as decision variables with the extra constraint x = M(y)?