ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

Deep Learning Week6 Notes

2022-05-16 03:00:07  阅读:140  来源: 互联网

标签:tensor nn text self Deep nb align Learning Notes


1. Benefits of depth

\(\text{Consider ReLU MLPs with a single Input/Output, there exists a network }f\) \(\text{ with }D^* \text{ layers, and }2D^* \text{ internal units, such that, for any network }g\text{ with }D\text{ layers of sizes }\{W^{(1)},...,W^{(D)} \}\), $\text{ since } $ $k(g)\leq 2^D \prod_{d=1}^D W^{(d)}: $

\[\begin{align} ||f-g||_1\geq 1-\frac{2^D}{2^{D^*}}\prod_{d=1}^DW^{(d)} \end{align} \]

\(\text{Inparticular, with }g\text{ a single hidden layer netowrk:}\)

\[||f-g||_1\geq 1-2\frac{W^{(1)}}{2^{D^*}} \]

\(\textbf{To approximate }f\textbf{ properly, the width }W^{(1)}\textbf{ of }g\textbf{'s hidden layer has to increase exponentially with }f\textbf{'s depth }D^*.\)

2. Rectifiers

\(\text{The derivative of } tanh\text{ has an exponential tail on both sides and collapses to 0 very quickly, while ReLU keeps the gradient of positive activations unchanged, which often correspond to half of them.}\)

Leaky-ReLU

\[\begin{align} \max(ax,x) \end{align} \]

where \(0\leq a< 1\)

3. Dropout

假设有 \(p\) 的概率被 drop,那么从期望来看:

\[\begin{align} \mathbb{E}(X) = (1-p)X+p\cdot 0 \end{align} \]

因此为了保证期望不变,只需要在 train 的时候,将激活函数 \(\times \frac{1}{1-p}\),然后在 test 的时候保持原网络不变即可。这样的方法被称为 \(\text{Inverted Dropout.}\)

>>> x = torch.full((3, 5), 1.0).requires_grad_()
>>> x
tensor([[ 1., 1., 1., 1., 1.],
        [ 1., 1., 1., 1., 1.],
        [ 1., 1., 1., 1., 1.]])
>>> dropout = nn.Dropout(p = 0.75)
>>> y = dropout(x)
>>> y
tensor([[ 0., 0., 4., 0., 4.],
        [ 0., 4., 4., 4., 0.],
        [ 0., 0., 4., 0., 0.]])
>>> l = y.norm(2, 1).sum()
>>> l.backward()
>>> x.grad
tensor([[ 0.0000, 0.0000, 2.8284, 0.0000, 2.8284],
[ 0.0000, 2.3094, 2.3094, 2.3094, 0.0000],
[ 0.0000, 0.0000, 4.0000, 0.0000, 0.0000]])

\(\text{Simply add dropout layers:}\)

model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(),
                        nn.Dropout(),
                        nn.Linear(100, 50), nn.ReLU(),
                        nn.Dropout(),
                        nn.Linear(50, 2));

4. Batch Normalize

\(\text{Forcing the activation statistics during the forward pass by re-normalizing them}\)
\(\\\)

Motivation:

\(\large\textbf{If the statistics of the activations are not controlled during training, a layer will have to adapt to the changes of the activations computed by the previous layers in addition to making changes to its own output to reduce the loss.}\)

\(\\\)
\(\text{During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.}\)

\(x_b\in \mathbb{R^D},b=1,...,B \text{ are samples in the batch, empirical mean and variance:}\)

\[\begin{align} \hat{m}&=\frac{1}{B}\sum_{b=1}^Bx_b\\ \hat{v}&=\frac{1}{B}\sum_{b=1}^B(x_b-\hat{m})^2 \end{align} \]

\(\text{Then do the normalization:}\)

\[\begin{align} z_b&=\frac{x_b-\hat{m}}{\sqrt{\hat{v}+\epsilon}}\\ y&=\gamma\odot z_b+\beta \end{align} \]

\(\text{where }z_b,y,\gamma,\beta \in \mathbb{R}^D\)

\(\\\)
\(\large\textbf{During inference: }\text{batch normalization shifts and rescales independently each component of the input }x\text{ according to statistics estimated during training}\)

\[\begin{align} y = \gamma\odot \frac{x-\hat{m}}{\sqrt{\hat{v}+\epsilon}}+\beta \end{align} \]

>>> bn = nn.BatchNorm1d(3)
>>> with torch.no_grad():
... bn.bias.copy_(torch.tensor([2., 4., 8.]))
... bn.weight.copy_(torch.tensor([1., 2., 3.]))
...
Parameter containing:
tensor([2., 4., 8.], requires_grad=True)
Parameter containing:
tensor([1., 2., 3.], requires_grad=True)
>>> x = torch.randn(1000, 3)
>>> x = x * torch.tensor([2., 5., 10.]) + torch.tensor([-10., 25., 3.])
>>> x.mean(0)
tensor([-9.9669, 25.0213, 2.4361])
>>> x.std(0)
tensor([1.9063, 5.0764, 9.7474])
>>> y = bn(x)
>>> y.mean(0)
tensor([2.0000, 4.0000, 8.0000], grad_fn=<MeanBackward2>)
>>> y.std(0)
tensor([1.0005, 2.0010, 3.0015], grad_fn=<StdBackward1>)

5. Layer Normalize

\(\text{Given a single sample }x\in\mathbb{R}^D,\text{ it normalizes the component of }x:\)

\[\begin{align} \mu&=\frac{1}{D}\sum_{d=1}^Dx_d\\ \sigma&=\sqrt{\frac{1}{D}\sum_{d=1}^D(x_d-\mu)^2}\\ \forall d, y_d&=\frac{x_d-\mu}{\sigma} \end{align} \]

6. ResNet

class ResBlock(nn.Module):
    def __init__(self, nb_channels, kernel_size):
        super().__init__()
        self.conv1 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
        padding = (kernel_size-1)//2)
        self.bn1 = nn.BatchNorm2d(nb_channels)
        self.conv2 = nn.Conv2d(nb_channels, nb_channels, kernel_size,
        padding = (kernel_size-1)//2)
        self.bn2 = nn.BatchNorm2d(nb_channels)
    
    def forward(self, x):
        y = self.bn1(self.conv1(x))
        y = F.relu(y)
        y = self.bn2(self.conv2(y))
        y += x
        y = F.relu(y)
        return y

\(\text{Add ResBlock to ResNet:}\)

class ResNet(nn.Module):
    def __init__(self, nb_channels, kernel_size, nb_blocks):
        super().__init__()
        self.conv0 = nn.Conv2d(1, nb_channels, kernel_size = 1)
        self.resblocks = nn.Sequential(
        # A bit of fancy Python
        *(ResBlock(nb_channels, kernel_size) for _ in range(nb_blocks))
        )
        self.avg = nn.AvgPool2d(kernel_size = 28)
        self.fc = nn.Linear(nb_channels, 10)

    def forward(self, x):
        x = F.relu(self.conv0(x))
        x = self.resblocks(x)
        x = F.relu(self.avg(x))
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

\(\textbf{Veit et al. (2016) interpret a residual network as an ensemble, which explains in part its stability}\)

\(e.g. \text{ with 3 blocks we have:}\)

\[\begin{align} x_1 &= x_0+f_1(x_0)\\ x_2 &= x_1+f_2(x_1)\\ x_3 &= x_2+f_3(x_2) \end{align} \]

\(\text{Hence there are 4 paths:}\)

\[\begin{align} x_3 &= x_2+f_3(x_2)\\ &=x_1+f_2(x_1)+f_3(x_1+f_2(x_1))\\ &=x_0+f_1(x_0)+f_2(x_0+f_1(x_0))+f_3(x_0+f_1(x_0)+f_2(x_0+f_1(x_0))) \end{align} \]

  • $\textbf{(1) performance reduction correlates with the number of paths removed from the ensemble, not with the number of blocks removed} $
  • \(\textbf{(2) only gradients through shallow paths matter during train. }\)

\(\\\)
\(\large\textbf{Summarize:}\)

  • \(\text{ReLU to prevent the gradient from vanishing during the backward pass}\)
  • \(\text{Dropout to force a distributed representation}\)
  • \(\text{Batch Normalization to dynamically maintain the statistics of activations}\)
  • \(\text{Identity pass-through to keep a structured gradient and distribute representation}\)
  • \(\text{smart initialization to put the gradient in a good regime}\)

标签:tensor,nn,text,self,Deep,nb,align,Learning,Notes
来源: https://www.cnblogs.com/xinyu04/p/16275523.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有