ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

Deep Learning Week13 Notes

2022-06-08 00:34:16  阅读:182  来源: 互联网

标签:mathbb Week13 text Notes attention times Learning model left


1. Attention for Memory and Sequence Translation

Attention mechanisms aggregate features with an importance score that:

  • depends on the feature themselves, not on their positions in the tensor
  • relax locality constraints.

\(\Large\text{Note:}\)

  • The attention mechanism allows the information to move from one part of the tensor to another part far way
  • For instance, in the case of sequence-to-sequence translation, it is able to use an information from early in the sentence to do a proper grammatical decision later
  • For images, it is able to combine information from different parts of the image even if there are far away

Neural Turing Machine

\(\large\textbf{Illustration: refer }\) Lecture-P6

The said module has an hidden internal state that takes the form of a tensor:

\[M_t\in \mathbb{R}^{N\times M} \]

where \(t\) is the time step, \(N\) is the number of entries in the memory and \(M\) is their dimension.

A “controller” is implemented as a standard feed-forward or recurrent model and at every iteration \(t\) it computes activations that modulate the reading / writing operations.

More formally, the memory module implements:

  • Reading, where given attention weights \(w_t\in\mathbb{R}_{+}^N, \sum_nw_t(n)=1\), it gets

\[r_t = \sum_{n=1}^Nw_t(n)M_t(n) \]

  • Writing, where given attention weights \(w_t\), an erase vector \(e_t\in [0,1]^M\), and an add vector \(a_t\in \mathbb{R}^M\), the memory is updated with:

\[\forall n, M_t(n) = M_{t-1}(n)[1-w_t(n)e_t]+w_t(n)a_t \]

The controller has multiple “heads”, and computes at each \(t\), for each writing head \(w_t, e_t, a_t\), and for each reading head \(w_t\), and gets back a read value \(r_t\).

Attention for seq2seq

Given an input sequence \(x_1,...,x_T\), the standard approach for sequence- to-sequence translation (Sutskever et al., 2014) uses a recurrent model:

\[h_t = f(x_t,h_{t-1}) \]

and considers that the final hidden state:

\[v = h_T \]

carries enough information to drive an auto-regressive generative model:

\[y_t\sim p(y_1,...,y_{t-1},v) \]

itself implemented with another RNN.

$\LARGE \star $ The main weakness of such an approach is that all the information has to flow through a single state \(v\), whose capacity has to accommodate any situation. There are no direct “channels” to transport local information from the input sequence to the place where it is useful in the resulting sequence.

Attention mechanisms (Bahdanau et al., 2014) can transport information from parts of the signal to other parts specified dynamically.

Bahdanau et al. (2014) proposed to extend a standard recurrent model with such a mechanism. They first run a bi-directionnal RNN to get a hidden state:

\[h_{i}=\left(h_{i}^{\rightarrow}, h_{i}^{\leftarrow}\right), \quad i=1, \ldots, T \]

From this, they compute a new process \(s_i,i = 1,...,T\) which looks weighted averages of the \(h_j\), where the weights are functions of the signal.

Given \(y_1,...,y_{i-1}\) and \(s_1,...,s_{i-1}\) first compute an attention:

\[\forall j, \alpha_{i, j}=\operatorname{softmax}_{j} a\left(s_{i-1}, h_{j}\right) \]

where \(a\) is a one hidden layer \(\tanh\) MLP. Then compute the context vector from \(h\):

\[c_i = \sum_{j=1}^T \alpha_{i,j} h_j \]

The model can now make the prediction:

\[\begin{align} s_i &= f(s_{i-1},y_{i-1},c_i)\\ y_i&\sim g(y_{i-1},s_i,c_i) \end{align} \]

where \(f\) is GRU.

\(\Large\textbf{Illustration: refer }\) Lecture-P20

2. Attention Mechanisms

  • The simplest form of attention is content-based attention. Given an “attention function”:

\[a:\mathbb{R}^{D'}\times\mathbb{R}^D\rightarrow \mathbb{R} \]

and model parameters:

\[\theta\in \mathbb{R}^{T\times D} \]

this operation takes a “value” tensor as input:

\[V\in \mathbb{R}^{T'\times D'} \]

and compute the output:

\[Y\in\mathbb{R}^{T\times D} \]

with

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \frac{\exp \left(a\left(V_{i} ; \theta_{j}\right)\right)}{\sum_{k=1}^{T} \exp \left(a\left(V_{k} ; \theta_{j}\right)\right)} V_{i} \\ &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(V_{i} ; \theta_{j}\right)\right) V_{i} \end{aligned} \]

  • This differs from context attention, which, given two inputs: a “context” tensor:

\[C\in \mathbb{R}^{T\times D} \]

and a "value" tensor:

\[V\in \mathbb{R}^{T'\times D} \]

computes a tensor

\[Y\in \mathbb{R}^{T\times D} \]

with

\[\begin{aligned} \forall j=1, \ldots, T, \quad Y_{j} &=\sum_{i=1}^{T^{\prime}} \operatorname{softmax}_{i}\left(a\left(C_j,V_{i} ; \theta\right)\right) V_{i} \end{aligned} \]

\(\large\text{Illustration the difference: }\)Lecture-P4

Using the terminology of Graves et al. (2014), attention is an averaging of values associated to keys matching a query. Hence the keys used for computing attention and the values to average are different quantities.

Given a query sequence \(Q\in\mathbb{R}^{T\times D}\), a key sequence \(K\in \mathbb{R}^{T'\times D}\) and a value sequence \(V\in\mathbb{R}^{T'\times D'}\). Compute a matrix \(A\in \mathbb{R}^{T\times T'}\), by matching \(Q\) to \(K\), and weight \(V\) with it to get the result sequence \(Y\in\mathbb{R}^{T\times D'}\),

\[\begin{align} \forall i, A_i &= \text{softmax}(\frac{KQ_i}{\sqrt{D}})\\ Y_i &= V^TA_i \end{align} \]

or

\[\begin{align} A &= \text{softmax}_{\text{row}}(\frac{QK^T}{\sqrt{D}})\in \mathbb{R}^{T\times T'}\\ Y&= AV\in\mathbb{R}^{T\times D'} \end{align} \]

The queries and keys have the same dimension \(D\), and there are as many keys \(T'\) as there are values. The result \(Y\) has as many rows \(T\) as there are queries, they are of same dimension \(D'\) as the values.

\(\large\text{Illustration: refer }\) Lecture-P9.

A standard attention layer takes as input two sequences \(X\) and \(X'\), and computes the tensors \(K,V,Q\) as the linear functions:

\[\begin{align} K&= W^KX\\ V&=W^VX\\ Q&=W^QX'\\ Y&=\text{softmax}_{\text{row}}(\frac{QK^T}{\sqrt{D}})V \end{align} \]

When \(X = X'\) , this is self attention, otherwise it is cross attention.

Multi-head attention combines several such operations in parallel, and \(Y\) is the concatenation of the results along the feature dimension.

\(\Large\textbf{Note:}\)

  • The terminology of attention mechanism comes from the paradigm of key-value dictionaries for data storage in which objects (the values) are stored using a key.
  • Querying the database consists of matching a query with the keys of the database to retrieve the values associated to them.
  • This is why matrices \(Q\) and \(K\) have the same number of columns, that correspond to the dimension \(D\) of individual keys or queries because we computes matches between them. The matrices \(K\) and \(V\) have the same number of rows \(T'\) because each value is “indexed” by one key.
  • Each row \(Y_j\) of the output corresponds to a weighted average of the values modulated by how much the query matched the associated key.
  • \(\LARGE\star\) This is exactly what an attention layer would do: equip the model with the ability to combine information from parts of the signal that it actively identifies as relevant.

\(\text{batch matrix product}\): torch.matmul()

>>> a = torch.rand(11, 9, 2, 3)
>>> b = torch.rand(11, 9, 3, 4)
>>> m = a.matmul(b)
>>> m.size()
torch.Size([11, 9, 2, 4])
>>>
>>> m[7, 1]
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>> a[7, 1].mm(b[7, 1])
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>>
>>> m[3, 0]
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])
>>> a[3, 0].mm(b[3, 0])
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])

\(\text{Attention layer Code:}\)

class AttentionLayer(nn.Module):
    def __init__(self, in_channels, out_channels, key_channels):
        super().__init__()
        self.conv_Q = nn.Conv1d(in_channels, key_channels, kernel_size = 1, bias = False)
        self.conv_K = nn.Conv1d(in_channels, key_channels, kernel_size = 1, bias = False)
        self.conv_V = nn.Conv1d(in_channels, out_channels, kernel_size = 1, bias = False)
    
    def forward(self, x):
        Q = self.conv_Q(x)
        K = self.conv_K(x)
        V = self.conv_V(x)
        A = Q.transpose(1, 2).matmul(K).softmax(2)
        y = A.matmul(V.transpose(1, 2)).transpose(1, 2)
        return y

The computation of the attention matrix \(A\) and the layer’s output \(Y\) could also be expressed somehow more clearly with Einstein summations:

A = torch.einsum('nct,ncs->nts', Q, K).softmax(2)
y = torch.einsum('nts,ncs->nct', A, V)

Positional Encoding

>>> len = 20
>>> c = math.ceil(math.log(len) / math.log(2.0))
>>> o = 2**torch.arange(c).unsqueeze(1)
>>> pe = (torch.arange(len).unsqueeze(0).div(o, rounding_mode = 'floor')) % 2
>>> pe
tensor([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])

3. Transformer Networks

\(\Large\text{Illustration: refer }\) Lecture-P2

\[\begin{aligned} \operatorname{Attention}(Q, K, V) &=\operatorname{softmax}\left(\frac{Q K^{\top}}{\sqrt{d_{k}}}\right) V \\ \operatorname{MultiHead}(Q, K, V) &=\operatorname{Concat}\left(H_{1}, \ldots, H_{h}\right) W^{O} \\ H_{i} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right), i=1, \ldots, h \end{aligned} \]

where

\[W_{i}^{Q} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, \quad W_{i}^{K} \in \mathbb{R}^{d_{\text {model }} \times d_{k}}, \quad W_{i}^{V} \in \mathbb{R}^{d_{\text {model }} \times d_{v}}, \quad W^{O} \in \mathbb{R}^{h d_{v} \times d_{\text {model }}} \]

\(\textbf{Positional information:}\)

\[\begin{gathered} P E_{t, 2 i}=\sin \left(\frac{t}{10,000^{\frac{2 i}{d_{\text {model }}}}}\right) \\ P E_{t, 2 i+1}=\cos \left(\frac{t}{10,000^{\frac{2 i+1}{d_{\text {model }}}}}\right) . \end{gathered} \]

\(\Large\text{Overall Illustration: refer }\) Lecture-P5

BERT (Bidirectional Encoder Representation from Transformers, Devlin et al., 2018) is a transformer pre-trained with:

  • Masked Language Model (MLM), that consists in predicting [\(15\)% of] words which have been replaced with a “MASK” token.
  • Next Sentence Prediction (NSP), which consists in predicting if a certain sentence follows the current one.

\(\Large\text{Illustration: refer }\) Lecture-P14

\(\text{GPT: a transformer trained for auto-regressive text generation}\) Lecture-P18

We can use HuggingFace’s pre-trained models:

import torch

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

tokens = tokenizer.encode('Studying Deep-Learning is')

for k in range(100): # no more than 100 tokens
    outputs = model(torch.tensor([tokens])).logits
    next_token = torch.argmax(outputs[0, -1])
    tokens.append(next_token)
    if tokenizer.decode([next_token]) == '.': break

print(tokenizer.decode(tokens))

Vision Transformers

\(\Large\text{Illustration: refer }\) Lecture-P31

标签:mathbb,Week13,text,Notes,attention,times,Learning,model,left
来源: https://www.cnblogs.com/xinyu04/p/16353961.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有