机器学习中的优化 Optimization Chapter 3 Projected Gradient Descent(2)

2022-05-19 20:00:56 阅读：197 来源： 互联网

标签：Chapter frac Descent Gradient align textbf mu leq gamma

1. Smooth and strongly convex functions: $O(\log(1/\epsilon))$ steps

$\large\textbf{Theorem 3.5}$:
$f:dom(f) \rightarrow \mathbb{R} $ convex and differentiable. $f$ is smooth with parameter $L$ and strongly convex with parameter $\mu$. Choosing stepsize:

\[\gamma = \frac{1}{L} \]

Projected gradient descent with arbitary $x_0$ satisifies the following $2$ properties:

$\text{(i) Squared distances to }x^*\text{ are geometrically decreasing:}$

\[\begin{align} ||x_{t+1}-x^*||^2\leq (1-\frac{\mu}{L})||x_t-x^*||^2 \end{align} \]

$\textbf{Proof:}$
By projected gradient descent:

\[\begin{align} y_{t+1} &= x_t-\gamma g_t\\ x_{t+1} &= \prod_x(y_{t+1}) = \arg\min||x-y_{t+1}||^2 \end{align} \]

We also have this inequality:

\[||x-\prod_X(y)||^2+||y-\prod_X(y)||^2\leq ||x-y||^2 \]

Therefore:

\[\begin{align} g_t^T(x_t-x^*) &= \frac{1}{\gamma}(x_t-y_{t+1})^T(x_t-x^*)\\ &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2] \end{align} \]

By inequality:

\[||x^*-x_{t+1}||^2+||y_{t+1}-x_{t+1}||^2\leq ||x^*-y_{t+1}||^2 \]

Hence, for ($5$):

\[\begin{align} g_t^T(x_t-x^*) &=\frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||y_{t+1}-x^*||^2]\\ &\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2] \end{align} \]

Then by strongly convex:

\[f(y)\geq f(x)+g_t^T(y-x)+\frac{\mu}{2}||x-y||^2 \]

Hence:

\[\begin{align} f(x_t)-f(x^*)&\leq g_t^T(x-x^*)-\frac{\mu}{2}||x_t-x^*||^2\\ &\leq \frac{1}{2\gamma}[\gamma^2||g_t||^2+||x_t-x^*||^2-||x^*-x_{t+1}||^2-||y_{t+1}-x_{t+1}||^2]-\frac{\mu}{2}||x_t-x^*||^2 \end{align} \]

Here we get $||x_{t+1}-x^*||^2$, so now we bound it by rewriting:

\[||x_{t+1}-x^*||\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2 \]

$\textbf{Remember in previous chapter Lemma 3.3:}$

\[\begin{align} f(x_{t+1})\leq f(x_t)-\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2 \end{align} \]

Thus:

\[\begin{align} f(x^*)-f(x_t)\leq f(x_{t+1})-f(x_t)\leq -\frac{1}{2L}||g_t||^2+\frac{L}{2}||y_{t+1}-x_{t+1}||^2 \end{align} \]

Then:

\[\begin{align} ||x_{t+1}-x^*||&\leq \gamma^2||g_t||^2+2\gamma[f(x^*)-f(x_t)]+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2\\ &\leq \gamma^2||g_t||^2+(1-\gamma\mu) ||x_t-x^*||^2-||y_{t+1}-x_{t+1}||^2+\gamma[-\frac{1}{L}||g_t||^2+L||y_{t+1}-x_{t+1}||^2]\\ &= (1-\gamma\mu) ||x_t-x^*||^2 \end{align} \]

$\text{(ii)}$

\[\begin{align} f(x_T)-f(x^*)\leq ||\nabla f(x^*)||(1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})^T||x_0-x^*||^2 \end{align} \]

$\textbf{Proof:}$
By smoothness we have:

\[\begin{align} f(x_T)-f(x^*)&\leq \nabla f(x^*)^T(x_T-x^*)+\frac{L}{2}||x^*-x_T||^2\\ &\leq ||\nabla f(x^*)||\cdot ||x_T-x^*||+\frac{L}{2}||x^*-x_T||^2\\ &\leq ||\nabla f(x^*)|| (1-\frac{\mu}{L})^{T/2}||x_0-x^*||+\frac{L}{2}(1-\frac{\mu}{L})||x_0-x^*||^2 \end{align} \]

2. Projecting onto $l_1$-balls

\[X = B_1(R) = \{ x\in\mathbb{R}^d:||x||_1 = \sum_{i=1}^d|x_i|\leq R \} \]

$\textbf{Fact 3.6}$ We may assume without loss of generality that $(i) R=1, (ii)v_i\geq 0$ for all $i$, and $(iii) \sum_{i=1}^dv_i>1$

$\textbf{Fact 3.7}$ Under assumptions of Fact $3.6$, $x = \prod_X(v)$ satisfies $x_i\geq 0$ for all $i$ and $\sum_{i=1}^dx_i=1 $

$\textbf{Corollary 3.8}$ Under the assumptions of Fact $3.6$,

\[\prod_X(v) = \arg\min_{x\in \Delta_d}||x-v||^2 \]

where

\[\Delta_d = \{x\in\mathbb{R}^d:\sum_{i=1}^dx_i=1,x_i\geq 0 \} \]

is the standard simplx.

$\textbf{Fact 3.9}$ We may assume that $v_1\geq v_2\geq ...\geq v_d$

$\large\textbf{Lemma 3.10}$ Let $x^* = \arg\min_{x\in\Delta_d}||x-v||^2 $. Under the assumption of Fact $3.9$, there exists a unique $p\in \{1,...,d \}$, such that:

\[\begin{align} x_i^*&>0,i\leq p,\\ x_i&=0, i>p \end{align} \]

$\textbf{Proof:}$
Recall previous lemma:

\[\nabla f(x^*)^T(x-x^*)\geq 0 \]

Therefore:

\[\nabla d_v(x^*)^T(x-x^*) = 2(x^*-v)^T(x-x^*)\geq 0,x\in \Delta_d \]

where

\[d_v(z) = ||z-v||^2 \]

Because $\sum_i x_i^*=1$, there is at least one positive entry in $x^*$. We still need to show that we cannot have $x_i^*=0$ and $x_{i+1}^*>0$. Indeed we could decrease $x_{i+1}^*$ by some positive $\epsilon$ and simultaneously increase $x_i^*$ to $\epsilon$ to obtain a vector $x\in\Delta_d$ such that:

\[(x^*-v)^T(x-x^*) = (0-v_i)\epsilon -(x_{i+1}^*-v_{i+1})\epsilon = \epsilon(v_{i+1}-v_i-x_{i+1}^*)<0 \]

contradicting the optimality.

$\large\textbf{Lemma 3.11}$ Under the assumption of Fact $3.9$, and with $p$ as in Lemma $3.10$:

\[x_i^* = v_i-\Theta_p,i\leq p \]

where

\[\Theta_p = \frac{1}{p}(\sum_{i=1}^Pv_i-1) \]

$\\$

$\large\textbf{Lemma 3.12}$ Under the assumption of Fact $3.9$, with $x^*(p)$ as

\[x^*(p) = (v_1-\Theta_p,...,v_p-\Theta_p,0,...0),p\in\{1,...,d \} \]

and with

\[p^* = \max\{p\in\{1,...,d\}:v_p-\frac{1}{p}(\sum_{i=1}^pv_i-1)>0 \} \]

it holds that:

\[\arg\min_{x\in\Delta_d}||x-v||^2 = x^*(p^*) \]

3. Proximal Gradient Descent

An important class of objective functions is composed as:

\[f(x) = g(x)+h(x) \]

where $g$ is a nice function. The classical gradient step for unconstrained minimization of a function $g$ can be equivalently written as

\[\begin{align} x_{t+1}&=\arg\min_{y}g(x_t)+\nabla g(x_t)^T(y-x_t)+\frac{1}{2\gamma}||y-x_t||^2\\ &=\arg\min_{y} \frac{1}{2\gamma}||y - (x_t-\gamma\nabla g(x_t))||^2 \end{align} \]

Proximal Gradient Algorithm

$\text{Proximal Mapping:}$

\[\text{prox}_{h,\gamma}(z) = \arg\min_y\{\frac{1}{2\gamma}||y-z||^2+h(y) \} \]

An iteration of $\text{Proximal gradient descent}$ is defined as:

\[x_{t+1}=\text{prox}_{h,\gamma}(x-\gamma\nabla g(x_t)) \]

标签：Chapter,frac,Descent,Gradient,align,textbf,mu,leq,gamma
来源： https://www.cnblogs.com/xinyu04/p/16289879.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

机器学习中的优化 Optimization Chapter 3 Projected Gradient Descent(2)

1. Smooth and strongly convex functions: \(O(\log(1/\epsilon))\) steps

2. Projecting onto \(l_1\)-balls

3. Proximal Gradient Descent

Proximal Gradient Algorithm