首页 > 其他分享> 文章详细

TensorFlow学习笔记——自然语言处理

2021-02-28 19:32:07 阅读：270 来源： 互联网

标签：size src batch trg 笔记 tf TensorFlow 自然语言 SIZE

引言

TensorFlow 版本1.15pip3 install tensorflow==1.15.0。
这是《TensorFlow实战Google深度学习框架(第2版)》的学习笔记，所有代码在TensorFlow 1.15版本中运行正常

语言模型的背景知识

语言模型简介

语言模型的任务就是预测每个句子在语言中出现的概率。对于语言中常见的句子，一个好的语言模型应得出相对较高的概率；而对于不合语法的句子，计算出的概率应接近于零。

语言模型的评价方法

语言模型效果好坏常用评价指标是困惑度(perplexity)，在一个测试集上得到的困惑度越低，说明建模的效果越好。计算困惑度的公式如下：

p e r p l e x i t y ( S ) = p ( w 1 , w 2 , w 3 , ⋯ , w m ) − 1 / m = 1 p ( w 1 , w 2 , w 3 , ⋯ , w m ) m = ∏ i = 1 m 1 p ( w i ∣ w 1 , ⋯ , w i − 1 ) m perplexity(S) = p(w_1,w_2,w_3,\cdots,w_m)^{-1/m} = \sqrt[m]{\frac{1}{p(w_1,w_2,w_3,\cdots,w_m)}} = \sqrt[m]{\prod_{i=1}^m \frac{1}{p(w_i|w_1,\cdots,w_{i-1})}} perplexity(S)=p(w1,w2,w3,⋯,wm)−1/m=mp(w1,w2,w3,⋯,wm)1 =mi=1∏mp(wi∣w1,⋯,wi−1)1

在语言模型的训练中，通常采用perplexity的对数表达形式：
log ⁡ ( p e r p e x i t y ( S ) ) = 1 m ∑ i = 1 m log ⁡ p ( w i ∣ w 1 , ⋯ , w i − 1 ) \log (perpexity(S)) = \frac{1}{m} \sum_{i=1}^m \log p(w_i|w_1,\cdots,w_{i-1}) log(perpexity(S))=m1i=1∑mlogp(wi∣w1,⋯,wi−1)

在数学上，log perplexity可以看成真实分布与预测分布之间的交叉熵。

TensorFlow中提供了两个方便计算交叉熵的函数：tf.nn.softmax_cross_entropy_with_logits和tf.nn.sparse_softmax_cross_entropy_with_logits。这两个函数的区别可以看下面的例子：

import tensorflow as tf

# 假设词汇表的大小为3，预料包含两个单词"2 0"
word_labels = tf.constant([2,0])


predict_logits = tf.constant([[2.0, -1.0, 3.0],[1.0,0.0,-0.5]])
# 使用sparse_softmax_cross_entropy_with_logits计算交叉熵
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=word_labels, logits=predict_logits)

sess = tf.Session()
print(sess.run(loss)) # array([0.32656264, 0.4643688 ], dtype=float32)

# softmax_cross_entropy_with_logits 需要将预测目标以概率分布的形式给出
word_prob_distribution = tf.constant([[0.0,0.0,1.0],[1.0,0.0,0.0]])

loss = tf.nn.softmax_cross_entropy_with_logits(
    labels=word_prob_distribution,logits=predict_logits
)
print(sess.run(loss)) # array([0.32656264, 0.4643688 ], dtype=float32)

神经语言模型

PTB数据集的预处理

数据集见下载地址。

数据集中共包含9998个不同的单词词汇，加上稀有词语的特殊符号<unk>和语句结束标记符，一共是10000个词汇。
为了将文本转化为模型可以读入的单词序列，需要将这个10000个不用的词汇分别映射到0~9999之间的整数编号。下面的程序首先按照词频顺序为每个词汇分配一个编号，然后将词汇表保存到一个独立的vocab文件中。

import codecs
import collections
from operator import itemgetter

RAW_DATA = './datasets/ptb/ptb.train.txt' # 训练集数据文件
VOCAB_OUTPUT= 'ptb.vocab' #输出的词汇表文件

counter = collections.Counter()
with codecs.open(RAW_DATA, 'r', 'utf-8') as f:
    for line in f:
        for word in line.strip().split():
            counter[word] += 1

# 按词频顺序对单词进行排序
sorted_word_to_cnt = sorted(counter.items(), key = itemgetter(1), reverse=True)#itemgetter(1)获得词频，根据词频进行排序
sorted_words = [x[0] for x in sorted_word_to_cnt]

# 后面我们需要在文本换行处加入句子结束符<eos>，这里预先将其加入词汇表
sorted_words =['<eos>'] + sorted_words

with codecs.open(VOCAB_OUTPUT, 'w', 'utf-8') as file_output:
    for word in sorted_words:
        file_output.write(word + '\n')

在确定了词汇表之后，再将训练文件、测试文件等都根据词汇文件转化为单词编号。每个单词的编号就是它在词汇表文件中的行号。

import codecs
import sys

RAW_DATA = './datasets/ptb/ptb.train.txt' # 训练集数据文件
VOCAB_OUTPUT= 'ptb.vocab' #词汇表文件
OUTPUT_DATA = 'ptb.train' #将单词替换为单词编号后的输出文件

# 读取词汇表，并建立单词到id的映射
with codecs.open(VOCAB_OUTPUT, 'r', 'utf-8') as f_vocab:
    vocab = [w.strip() for w in f_vocab.readlines()]

# 单词到id的映射
word_to_id = {k: v for (k,v) in zip(vocab, range(len(vocab)))}

# 如果出现了低频词，则替换为'<unk>'
def get_id(word):
    unk_id = word_to_id.get('<unk>')
    return word_to_id.get(word,unk_id)

fin = codecs.open(RAW_DATA, 'r', 'utf-8')
fout = codecs.open(OUTPUT_DATA,'w','utf-8')

for line in fin:
    words = line.strip().split() + ['<eos>'] #读取句子并添加<eos>结束符
    out_line = ' '.join([str(get_id(w)) for w in words]) + '\n'
    fout.write(out_line)

fin.close()
fout.close()

PTB数据的batching方法

在文本数据中，由于每个句子的长度不同，因此在对文本数据进行batching时需要采取一些特殊操作。最常见的方法就是使用填充(padding)将同一batch内的句子长度补齐。

在PTB数据集中，每个句子并非随机抽取的文本，而是在上下文之间有关联的内容。
语言模型为了利用上下文信息，必须将前面句子的信息传递到后面的句子。为了实现这个目标，在PTB上下文有关联的数据集上，通常采用另一种batching方法。

如果模型大小没有限制，最理想的设计是将整个文档前后连接起来，当作一个句子来训练。

但现实中这是无法实现的。对此问题的解决方法是，将长序列切割为固定长度的子序列。RNN在处理完一个子序列后，它最终的隐藏状态将复制到下一个序列中作为初始值，这样在前向计算时，效果等同于一次性顺序地读取了整个文档；而在反向传播时，梯度只在每个子序列内部传播。

为了利用计算时的并行能力，我们希望每一次计算可以对多个句子进行并行处理，同时又要保证batch之间的上下文连续。
解决方案是，先将整个文档切分成若干连续段落，再让batch中的每一个位置负责其中一段。

例如，如果batch大小是4，则先将整个文档平均分成4个子序列(这样每个子序列可能包含多个文档，和文档的一部分)，让batch中的每一个位置负责其中一个子序列，这样每个子文档内部的所有数据扔可以被顺序处理。

下面的代码从文本文件中读取数据，并按上面介绍的方法将数据整理成batch。由于PTB数据集比较小，因此可以直接将整个数据集一次性读入内存。

import numpy as np
import tensorflow as tf

TRAIN_DATA = 'ptb.train'
TRAIN_BATCH_SIZE = 20
TRAIN_NUM_STEP = 35 #可以看成是假定句子的单词数

# 从文件中读取数据，并返回包含单词编号的数组
def read_data(file_path):
    with open(file_path, 'r') as fin:
        # 将整个文档读进一个长字符串
        id_string = ' '.join([line.strip() for line in fin.readlines()])
    id_list = [int(w) for w in id_string.split()] #将读取的单词编号转为整数
    return id_list

def make_batches(id_list, batch_size, num_step):
    # 计算总的batch数量，每个batch包含的单词数量是batch_size *num_step
    num_batches = (len(id_list) -1) // (batch_size * num_step)
    
    # 将数据整理成一个维度为[batch_size, num_batches * num_step]的二维数据
    data = np.array(id_list[:num_batches * batch_size * num_step])
    data = np.reshape(data, [batch_size, num_batches * num_step])
    # 沿着第二个维度将数据切分成num_batches个batch，存入一个数组
    data_batches = np.split(data, num_batches, axis=1)
    
    # 重复上述操作，每个位置向右移动一位。得到的是RNN每一步输出所需要预测的下一个单词。
    label = np.array(id_list[1:num_batches * batch_size * num_step + 1])
    label = np.reshape(label, [batch_size, num_batches * num_step])
    label_batches = np.split(label, num_batches, axis=1)
    
    return list(zip(data_batches, label_batches))


train_batches = make_batches(read_data(TRAIN_DATA), TRAIN_BATCH_SIZE, TRAIN_NUM_STEP)

基于循环神经网络的神经语言模型

NLP应用中主要多了两个层：词向量层(embedding)和softmax层

词向量层

词向量可以理解为将词汇表嵌入到一个固定维度的实数空间里，将单词转化为词向量主要有两大作用：

降低输入的维度
增加语义信息

假设词向量的维度是EMB_SIZE，词汇表的大小为VOCABL_SIZE，那么所有单词的词向量可以放入一个大小为VOCAB_SIZE x EMB_SIZE的矩阵内。
在读取词向量时，可以调用tf.nn.embedding_lookup方法。

embedding = tf.get_variable('embedding', [VOCAB_SIZE, EMB_SIZE])
# 输出的矩阵比输入数据多一个维度，新增维度的大小是 EMB_SIZE 。在语言模型中，
# input_data 的维度是 batch_size x num_steps ，而输出的 input_embedding 的维度是
# batch_size x num_steps x EMB_SIZE
input_embedding = tf.nn.embedding_lookup(embedding, input_data)

Softmax层

Softmax层的作用是将循环神经网络的输出转化为一个单词表中每个单词的输出概率。

为此需要有两个步骤：

使用一个线性映射将循环神经网络的输出映射为一个维度与词汇表大小相同的向量。这一步的输出叫作logits。

# HIDDEN_SIZE是RNN的隐藏状态维度，VOCAB_SIZE是词汇表的大小
weight = tf.get_variable('weight', [HIDDEN_SIZE, VOCAB_SIZE])
bias = tf.get_variable('bias', [VOCAB_SIZE])
# 计算线性映射
logits = tf.nn.bias_add(tf.matmul(output, weight), bias)

调用softmax将logits转化为加和为1的概率。

probs = tf.nn.softmax(logits)

模型训练通常并不关心概率的具体取值，更关心最终的log perplexity，因此可以调用
tf.nn.sparse_softmax_cross_entropy_with_logits方法直接从logits计算log perplexity作为损失函数：

loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=tf.reshape(self.targets, [-1]),logits=logits)

通过共享参数减少参数数量

Softmax层和词向量层的参数数量都与词汇表大小VOCAB_SIZE成正比。由于VOCAB_SIZE的数值通常较大，而HIDDEN_SIZE相对较小，导致softmax和embdding在整个网络的参数数量中占有很大比例。

有研究指出，如果共享词向量层和Softmax层的参数，不仅能大幅度减少参数数量，还能提高最终模型效果。下面完整的代码样例中实现了这一方法。

完整的训练程序

import numpy as np
import tensorflow as tf
# 处理成单词id的文件路径
TRAIN_DATA = 'datasets/ptb/ptb.train'
EVAL_DATA = 'datasets/ptb/ptb.valid'
TEST_DATA = 'datasets/ptb/ptb.test'

HIDDEN_SIZE = 300  # 隐藏层大小
NUM_LAYERS = 2  # 深层循环神经网络中LSTM结构的层数
VOCAB_SIZE = 10000  # 词典大小
TRAIN_BATCH_SIZE = 20  # 训练数据batch的大小
TRAIN_NUM_STEP = 35  # 训练数据截断长度

EVAL_BATCH_SIZE = 1  # 测试数据batch的大小
EVAL_NUM_STEP = 1  # 测试数据截断长度
NUM_EPOCH = 5  # 使用训练数据的轮数
LSTM_KEEP_PROB = 0.9  # LSTM截断不被dropout的概率
EMBEDDING_KEEP_PROB = 0.9  # 词向量不被dropout的概率
MAX_GRAD_NORM = 5  # 控制梯度膨胀的梯度大小上限
SHARE_EMB_AND_SOFTMAX = True  # 在Softmax层和词向量层之间共享参数


class PTBModel:
	def __init__(self, is_trainning, batch_size, num_steps):
		# 记录使用的batch大小和截断长度
		self.batch_size = batch_size
		self.num_steps = num_steps

		# 定义每一步的输入和预期输出。两者的维度都是[batch_size, num_step]
		self.input_data = tf.placeholder(tf.int32, [batch_size, num_steps])
		self.targets = tf.placeholder(tf.int32, [batch_size, num_steps])

		# 使用LSTM结构为循环体结构，且使用dropout的深层循环神经网络
		dropout_keep_prob = LSTM_KEEP_PROB if is_trainning else 1.0
		# 深层带Dropout的LSTM
		lstm_cells = [
			tf.nn.rnn_cell.DropoutWrapper(
				# 返回一个元组
				tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE),
				output_keep_prob=dropout_keep_prob) for _ in range(NUM_LAYERS)
		]
		stacked_lstm = tf.nn.rnn_cell.MultiRNNCell(lstm_cells)

		# 初始化最初的全零向量 [batch_size, state_size]  -> (20,300)
		self.initial_state = stacked_lstm.zero_state(batch_size, tf.float32)
		embedding = tf.get_variable('embedding', [VOCAB_SIZE, HIDDEN_SIZE]) # 词典大小(VOCAB_SIZE)，词向量维度(EMB_SIZE)

		# 将输入单词转化为词向量(batch_size,num_steps,EMB_SIZE)
		inputs = tf.nn.embedding_lookup(embedding, self.input_data)

		# 只在训练时使用dropout
		if is_trainning:
			inputs = tf.nn.dropout(inputs, EMBEDDING_KEEP_PROB)

		# 定义输出列表
		outputs = []
		state = self.initial_state
		with tf.variable_scope('RNN'):
			for time_step in range(num_steps):
				if time_step > 0:
					tf.get_variable_scope().reuse_variables()
				# 一次输入一个时间步数据,并且传入上一个时间步的state,返回一个元组(Output,New state)
				# cell_output(Output): [batch_size, self.output_size:创建LSTM时的神经元个数]
				# state(New state):    [batch_size, state_size]
				cell_output, state = stacked_lstm(inputs[:, time_step, :], state)
				# 保存输出列表
				outputs.append(cell_output)
		# 把输出队列展开成[batch, hidden_size*num_steps]的形状，
		# 再reshape成[batch*numsteps, hidden_size]的形状
		# outputs是TRAIN_NUM_STEP个 (batch_size,HIDDEN_SIZE)的列表
		# tf.concat(outputs, 1) #35个 20x300 按维度1拼接 -> (20,300x35)
		output = tf.reshape(tf.concat(outputs, 1), [-1, HIDDEN_SIZE])

		# 是否共享参数
		if SHARE_EMB_AND_SOFTMAX:
			weight = tf.transpose(embedding)  # embedding转置即可
		else:
			weight = tf.get_variable('weight', [HIDDEN_SIZE, VOCAB_SIZE])
		# 全连接层 weight是全连接层的权重
		bias = tf.get_variable('bias', [VOCAB_SIZE])
		logits = tf.matmul(output, weight) + bias # batch*numsteps,VOCAB_SIZE

		# 定义交叉熵损失函数和平均损失
		# tf.reshape(self.targets, [-1]) 变成了(batch*numsteps,) 类似np.squeeze()
		loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.reshape(self.targets, [-1]), logits=logits)
		self.cost = tf.reduce_sum(loss) / batch_size
		self.final_state = state

		# 只在训练模型时定义反向传播操作
		if not is_trainning:
			return

		trainable_variables = tf.trainable_variables()
		# 控制梯度大小，定义后话方法和训练步骤
		grads, _ = tf.clip_by_global_norm(
			tf.gradients(self.cost, trainable_variables), MAX_GRAD_NORM  # tf.gradients求对trainable_variables的梯度
		)
		optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
		self.train_op = optimizer.apply_gradients(zip(grads, trainable_variables))  # 把梯度应用到变量上，参数是(gradient, variable)列表


# 使用给定的模型在数据上运行train_op并返回在全部数据上的困惑度
def run_epoch(session, model, batches, train_op, output_log, step):
	total_costs = 0.0
	iters = 0
	state = session.run(model.initial_state)
	# 训练一个epoch
	for x, y in batches:
		# 在当前batch上运行train_op并计算损失值
		cost, state, _ = session.run(
			[model.cost, model.final_state, train_op],
			{
				model.input_data: x, model.targets: y, model.initial_state: state
			}
		)
		total_costs += cost
		iters += model.num_steps

		if output_log and step % 100 == 0:
			print('After %d steps, perplexity is %.3f' % (step, np.exp(total_costs / iters)))
		step += 1
	return step, np.exp(total_costs / iters)


# 从文件中读取数据，并返回包含单词编号的数组
def read_data(file_path):
	with open(file_path, 'r') as fin:
		# 将整个文档读进一个长字符串
		id_string = ' '.join([line.strip() for line in fin.readlines()])
	id_list = [int(w) for w in id_string.split()]  # 将读取的单词编号转为整数
	return id_list


def make_batches(id_list, batch_size, num_step):
	# 计算总的batch数量，每个batch包含的单词数量是batch_size *num_step
	num_batches = (len(id_list) - 1) // (batch_size * num_step)

	# 将数据整理成一个维度为[batch_size, num_batches * num_step]的二维数据
	data = np.array(id_list[:num_batches * batch_size * num_step])
	data = np.reshape(data, [batch_size, num_batches * num_step])
	# 沿着第二个维度将数据切分成num_batches个batch，存入一个数组
	data_batches = np.split(data, num_batches, axis=1)

	# 重复上述操作，每个位置向右移动一位。得到的是RNN每一步输出所需要预测的下一个单词。
	label = np.array(id_list[1:num_batches * batch_size * num_step + 1])
	label = np.reshape(label, [batch_size, num_batches * num_step])
	label_batches = np.split(label, num_batches, axis=1)

	return list(zip(data_batches, label_batches))


def main():
	# 定义初始化函数
	initializer = tf.random_uniform_initializer(-0.05, 0.05)
	# 定义训练用的循环神经网络模型
	with tf.variable_scope('language_model', reuse=None, initializer=initializer):
		train_model = PTBModel(True, TRAIN_BATCH_SIZE, TRAIN_NUM_STEP)

	# 定义测试用的循环神经网络模型，与train_model共用参数，但是没有dropout
	with tf.variable_scope('language_model', reuse=True, initializer=initializer):
		eval_model = PTBModel(False, EVAL_BATCH_SIZE, EVAL_NUM_STEP)

	# 训练模型
	with tf.Session() as sess:
		tf.global_variables_initializer().run()
		train_batches = make_batches(read_data(TRAIN_DATA), TRAIN_BATCH_SIZE, TRAIN_NUM_STEP)
		eval_batches = make_batches(read_data(EVAL_DATA), EVAL_BATCH_SIZE, EVAL_NUM_STEP)
		test_batches = make_batches(read_data(TEST_DATA), EVAL_BATCH_SIZE, EVAL_NUM_STEP)

		step = 0
		for i in range(NUM_EPOCH):
			print('in iteration: %d' % (i + 1))
			step, train_pplx = run_epoch(sess, train_model, train_batches, train_model.train_op,
			                             True, step)
			print('Epoch: %d Train Perplexity: %.3f' % (i + 1, train_pplx))

			_, eval_pplx = run_epoch(sess, eval_model, eval_batches, tf.no_op(), False, 0)

			print('Epoch: %d Eval Perplexity: %.3f' % (i + 1, eval_pplx))

		_, test_pplx = run_epoch(sess, eval_model, test_batches, tf.no_op(), False, 0)
		print('Test Perplexity: %.3f' % test_pplx)


if __name__ == '__main__':
	main()

输出：

in iteration: 1
After 0 steps, perplexity is 10038.498
After 100 steps, perplexity is 1654.772
After 200 steps, perplexity is 1140.188
After 300 steps, perplexity is 901.156
After 400 steps, perplexity is 742.337
After 500 steps, perplexity is 634.340
After 600 steps, perplexity is 562.385
After 700 steps, perplexity is 506.856
After 800 steps, perplexity is 457.397
After 900 steps, perplexity is 421.076
After 1000 steps, perplexity is 394.789
After 1100 steps, perplexity is 368.277
After 1200 steps, perplexity is 347.685
After 1300 steps, perplexity is 327.851
Epoch: 1 Train Perplexity: 324.737
Epoch: 1 Eval Perplexity: 184.595
...
in iteration: 5
After 5400 steps, perplexity is 74.065
After 5500 steps, perplexity is 75.321
After 5600 steps, perplexity is 78.474
After 5700 steps, perplexity is 76.364
After 5800 steps, perplexity is 75.256
After 5900 steps, perplexity is 75.555
After 6000 steps, perplexity is 75.668
After 6100 steps, perplexity is 74.440
After 6200 steps, perplexity is 74.167
After 6300 steps, perplexity is 74.617
After 6400 steps, perplexity is 73.967
After 6500 steps, perplexity is 73.245
After 6600 steps, perplexity is 72.348
Epoch: 5 Train Perplexity: 72.527
Epoch: 5 Eval Perplexity: 107.859
Test Perplexity: 104.299

从输出结果可以看出，迭代开始时，困惑度的值为10038.498，相当于从1万个单词中随机选择下一个单词。而在结束训练后，训练数据上的困惑度降低到了72.527。说明通过训练过程，将选择下一个单词的范围从1万个减少到了大约73个。

神经网络机器翻译

机器翻译背景与Seq2Seq模型介绍

Seq2Seq模型的基本思想非常简单——使用一个循环神经网络读取输入句子，将整个句子的信息压缩到一个固定维度的编码中；再使用另一个循环神经网络读取这个编码，将其解压为目标语言的一个句子。这两个循环神经网络分别称为编码器(Encoder)和解码器(Decoder)，这个模型也称为encoder-decoder模型。

在这里插入图片描述

解码器部分的结构与语言模型几乎完全相同：输入为单词的词向量，输出为softmax层产生的单词概率，损失函数为log perplexity。
事实上，解码器可以理解为一个以输入编码为前提的语言模型。语言模型中使用的一些技巧，如共享softmax层和词向量层的参数，都可以直接应用到Seq2Seq模型的解码器中。

编码器部分则更简单，它与解码器一样拥有词向量层和循环神经网络，但是由于在编码阶段并未输出，因此不需要softmax层。
在训练过程汇总，编码器顺序读入每个单词的词向量，然后将最终的隐藏状态复制到解码器作为初始状态。
解码器的第一个输入是一个特殊的<sos>字符，每一步预测的单词是训练数据的目标句子，预测序列的最后一个单词是与语言模型相同的<eos>字符。

机器翻译测试的方法是，让解码器在没有正确答案的情况下自主生产一个翻译句子，然后采用人工或自动的方法对翻译句子的质量进行评测。让解码器生成句子的过程也称为解码。在解码过程中，每一步预测的单词中概率最大的单词被选为这一步的输出(这是贪心算法，或者采用束搜索)，并复制到下一步的输入中。

机器翻译文本数据的预处理

本文的数据集采用的是一个较小的IWLST TED演讲数据集。下面以英文-中文数据为例，它的英文-中文训练数据包含约21万个句子对。

对于语料的预处理，其步骤和上面介绍的关于PTB数据的预处理基本是一样的，首先，需要统计语料中出现的单词，为每个单词分配一个ID，将词汇表存入一个vocab文件，然后将文本转化为用单词编号的形式来表示。

与前面不同之处在于，下载的文本没有经过预处理，主要是没有经过切词。由于每个英文单词和标点符号之间是紧密相连的，导致不能像处理PTB数据那样直接用空格对单词继续努力分割。为此需要用一些独立的工具来进行切词操作。

最常用的开源切词工具是moses，

它的使用方法如下：

perl ./moses_tokenizer.perl -no-escape -l en < ./train.raw.en > train.txt.en

切词前的文本如下：
And we knew it was volcanic back in the '60s, '70s.
切词后的文本如下，注意单引号和数字、逗号和句号之前都增加了空格：
And we knew it was volcanic back in the ' 60s , ' 70s .

对于中文文本而言，为了方便起见，本书的例子中直接以字为单位进行切割。

sed 's/ //g; s/\B/ /g' ./train.raw.zh > train.txt.zh

切词前的文本如下：
六七十年代时我们只知道这是一座火山。
切词后的文本如下。每个字和符号之间都增加了空格。
六七十年代时我们只知道这是一座火山。

完成切词后，再使用核处理PTB数据相同的方法，分别生成英文文本和中文文本词汇文件，并将文本转化为单词编号。

下面的例子中，假定英文词汇表大小为10000，中文词汇表大小为4000.
在机器翻译的训练样本中，每个句子对通常是作为独立的数据来训练的。由于每个句子的长度不一，因此将这些句子放入同一个batch时，需要将较短的句子补齐到与通batch内最长句子相同的长度。用于填充长度而填入的位置叫做填充(padding)。

在TensorFlow中，tf.data.Dataset的padded_batch函数提供了这一功能。
下表给出了一个填充示例。假设一个数据集有4句话，分别是" A 1 A 2 A 3 A 4 A_1A_2A_3A_4 A1A2A3A4"，" B 1 B 2 B_1B2 B1B2"，" C 1 C 2 C 3 C 4 C 5 C 6 C 7 C_1C_2C_3C_4C_5C_6C_7 C1C2C3C4C5C6C7“和"D_1”，
将它们加入必要的填充并组成大小为2个batch后，得到的batch如下图所示：

在这里插入图片描述

RNN在读取数据时会将填充位置的内容与其他内容一样纳入计算，因此为了不让填充影响训练，有两方面要注意：

一方面， RNN在读取填充时，应该跳过这一位置的计算。以编码器为例，如果编码器在读取填充时，像正常输入一样处理填充输入，那么在读取" B 1 B 2 00 B_1B_200 B1B200“之后产生的最后一位隐藏状态就和读取” B 1 B 2 B_1B_2 B1B2"之后的隐藏状态不同，会产生错误的结果。

TensorFlow提供了tf.nn.dynamic_rnn方法来实现这一功能。dynamic_rnn对每一个batch的数据读取两个输入：输入数据的内容(形状[batch_size,time])和输入数据的长度(形状为[time])。对于输入batch里的每一条数据，在读取了相应长度的内容后，dynamic_rnn就跳过后面的输入，直接把前一步的计算结果复制到后面的时间步。这样保证padding是否存在都不影响模型效果。

另外值得注意的是，使用dynamic_rnn时每个batch的最大序列长度不需要相同。例如在上面的例子中，第一个batch的维度是 2 × 4 2 \times 4 2×4，而第二个batch的维度是 2 × 7 2 \times 7 2×7。在训练中dynamic_rnn会根据每个batch的最大长度动态展开到需要的层数，这就是它被称为dynamic的原因。

第二，在设计损失函数时需要特别将填充位置的损失的权重设置为0，这样在填充位置产生的预测不会影响梯度的计算。

下面的代码使用tf.data.Dataset.padded_batch来进行填充和batching，并记录每个句子的序列长度以用于dynamic_rnn的输入。
这里没有将所有数据读入内存，而是使用Dataset从磁盘动态读取数据。

# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

MAX_LEN = 50  # 限定句子的最大单词数量
SOS_ID = 1    # 目标词汇表中<sos>的ID
EOS_ID = 2    # 词汇表中<eos>的ID


# 使用Dataset从一个文件中读取一个语言的数据
# 数据的格式为每行一句话，单词已经转化为单词编号
def makeDataset(file_path):
	dataset = tf.data.TextLineDataset(file_path)
	# 根据空格将单词编号切分开并放入一个一维向量
	dataset = dataset.map(lambda string: tf.string_split([string]).values)
	# 将字符串形式的单词编号转化为整数
	dataset = dataset.map(lambda string: tf.string_to_number(string, tf.int32))
	# 统计每个句子的单词数量，并与句子内容一起放入Dataset中
	dataset = dataset.map(lambda x: (x, tf.size(x)))
	return dataset


# 从源语言文件src_path和目标语言文件trg_path中分别读取数据，并进行填充和batching操作
def makeSrcTrgDataset(src_path, trg_path, batch_size):
	# 首先分别读取源语言数据和目标语言数据
	src_data = makeDataset(src_path)
	trg_data = makeDataset(trg_path)
	# 通过zip操作将两个Dataset合并为一个Dataset
	# 现在每个Dataset中每项数据ds由4个张量组成：
	# ds[0][0] 是源句子
	# ds[0][1] 是源句子长度
	# ds[1][0] 是目标句子
	# ds[1][1] 是目标句子长度
	dataset = tf.data.Dataset.zip((src_data, trg_data))

	# 删除内容为空(只包含<eos>)的句子和长度过长的句子
	def filterLength(src_tuple, trg_tuple):
		((src_input, src_len), (trg_label, trg_len)) = (src_tuple, trg_tuple)
		src_len_ok = tf.logical_and(tf.greater(src_len, 1), tf.less_equal(src_len, MAX_LEN))
		trg_len_ok = tf.logical_and(tf.greater(trg_len, 1), tf.less_equal(trg_len, MAX_LEN))
		return tf.logical_and(src_len_ok, trg_len_ok)

	dataset = dataset.filter(filterLength)

	# 解码器需要两种格式的目标句子：
	#   1. 解码器的输入(trg_input),形式如同 "<sos> X Y Z"
	#   2. 解码器的模板输出(trg_label),形式如同"X Y Z <eos>"
	#   从文件中读到的目标句子是"X Y Z <eos>"的形式，我们需要从中生成"<sos> X Y Z"形式并加入到Dataset中
	def makeTrgInput(src_tuple, trg_tuple):
		((src_input, src_len), (trg_label, trg_len)) = (src_tuple, trg_tuple)
		trg_input = tf.concat([[SOS_ID], trg_label[:-1]], axis=0) #  trg_label[:-1]去掉 <eos>，并加入<sos>
		return ((src_input, src_len), (trg_input, trg_label, trg_len))

	dataset = dataset.map(makeTrgInput)

	# 随机打乱训练数据
	dataset = dataset.shuffle(10000)

	# 规定填充后输出的数据维度
	padded_shapes = (
		(tf.TensorShape([None]), #长度未知的向量
		 tf.TensorShape([])), #长度为单个数字
		(tf.TensorShape([None]),
		 tf.TensorShape([None]),
		 tf.TensorShape([]))
	)
	# 调用padded_batch方法进行batching操作，按batch_size进行分批，tf.TensorShape([None])在分批时，当长度不一致时，会用0进行填充
	batched_dataset = dataset.padded_batch(batch_size, padded_shapes)
	return batched_dataset

Seq2Seq模型的代码实现

本节将完整实现一个Seq2Seq模型。在本节中，模型的训练和测试将分为两个程序来实现。

首先来看模型训练的实现。使用一个双层LSTM作为循环神经网络的主体，并在Softmax层和词向量层之间共享参数。与上面实现的语言模型相比，下面代码的主要变化有以下几点：

增加了一个循环神经网络作为编码器
使用Dataset动态读取数据
每个batch完全独立，不需要在batch之间传递状态
每训练200步便将模型参数保存到一个checkpoint中

# -*- coding: utf-8 -*-
import tensorflow as tf
from data_util import makeDataset, makeSrcTrgDataset

# 输入数据已经转换成了单词编号的格式
SRC_TRAIN_DATA = './datasets/ted/train.en'  # 源语言输入文件
TRG_TRAIN_DATA = './datasets/ted/train.zh'  # 目标语言输入文件
CHECKPOINT_PAHT = './models/seq2seq_ckpt'  # checkpoint保存路径
HIDDEN_SIZE = 1024  # LSTM的隐藏层大小
NUM_LAYERS = 2  # 深层循环神经网络中LSTM结构的层数
SRC_VOCAB_SIZE = 10000  # 源语言词汇表大小
TRG_VOCAB_SIZE = 4000  # 目标语言词汇表大小
BATCH_SIZE = 100  # 训练数据batch大小
NUM_EPOCH = 5  # 使用训练数据的轮数
KEEP_PROB = 0.8  # 节点不被dropout的概率
MAX_GRAD_NORM = 5  # 用于控制梯度膨胀的梯度大小上限
SHARE_EMB_AND_SOFTMAX = True  # 在Softmax层和词向量层之间共享参数


class NMTModel:
	def __init__(self):
		# 定义编码器和解码器所使用的LSTM结构
		self.enc_cell = tf.nn.rnn_cell.MultiRNNCell(
			[tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE) for _ in range(NUM_LAYERS)]
		)
		self.dec_cell = tf.nn.rnn_cell.MultiRNNCell(
			[tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE) for _ in range(NUM_LAYERS)]
		)

		# 为源语言和目标语言分别定义词向量
		self.src_embedding = tf.get_variable('src_emb', [SRC_VOCAB_SIZE, HIDDEN_SIZE])
		self.trg_embedding = tf.get_variable('trg_emb', [TRG_VOCAB_SIZE, HIDDEN_SIZE])

		# 定义softmax层的变量
		if SHARE_EMB_AND_SOFTMAX:
			self.softmax_weight = tf.transpose(self.trg_embedding)
		else:
			self.softmax_weight = tf.get_variable('weight', [HIDDEN_SIZE, TRG_VOCAB_SIZE])

		self.softmax_bias = tf.get_variable('softmax_bias', [TRG_VOCAB_SIZE])

	# 在forward函数中定义模型的前向计算图
	# src_input, src_size, trg_input, trg_label, trg_size是 makeSrcTrgDataset函数产生的五种张量
	def forward(self, src_input, src_size, trg_input, trg_label, trg_size):
		batch_size = tf.shape(src_input)[0]
		# 将输入和输出单词编号转为词向量
		src_emb = tf.nn.embedding_lookup(self.src_embedding, src_input)
		trg_emb = tf.nn.embedding_lookup(self.trg_embedding, trg_input)

		# 在词向量上进行dropout
		src_emb = tf.nn.dropout(src_emb, KEEP_PROB)
		trg_emb = tf.nn.dropout(trg_emb, KEEP_PROB)

		# 使用dynamic_rnn构造编码器
		# 编码器读取源句子每个位置的词向量，输出最后一步的隐藏状态enc_state
		# 因为编码器是一个双层LSTM，因此enc_state是一个包含两个LSTMStateTuple类
		# 的tuple，每个LSTMStateTuple对应编码器中一层的状态。
		# enc_outputs是顶层LSTM在每一步的输出，它的维度是[batch_size,max_time,HIDDE_SIZE]
		with tf.variable_scope('encoder'):
			enc_outputs, enc_state = tf.nn.dynamic_rnn(
				self.enc_cell, src_emb, src_size, dtype=tf.float32
			)

		# 使用dynamic_rnn构造解码器
		# 解码器读取目标句子每个位置的词向量，输出的dec_outputs为每一步
		# 顶层LSTM的输出。dec_outputs的维度是[batch_size, max_time, HIDDEN_SIZE]
		# initial_state = enc_state表示用编码器的输出来初始化第一步的隐藏状态
		with tf.variable_scope('decoder'):
			dec_outputs, _ = tf.nn.dynamic_rnn(
				self.dec_cell, trg_emb, trg_size, initial_state=enc_state
			)

		# 计算解码器每一步的log perplexity。这一步与语言模型代码相同
		output = tf.reshape(dec_outputs, [-1, HIDDEN_SIZE])
		logits = tf.matmul(output, self.softmax_weight) + self.softmax_bias
		loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
			labels=tf.reshape(trg_label, [-1]), logits=logits
		)

		# 在计算平均损失时，需要将填充位置的权重设置为0，以避免无效位置的预测干扰模型的训练
		label_weights = tf.sequence_mask(
			trg_size, maxlen=tf.shape(trg_label)[1], dtype=tf.float32
		)
		label_weights = tf.reshape(label_weights, [-1])
		cost = tf.reduce_sum(loss * label_weights)
		cost_per_token = cost / tf.reduce_sum(label_weights)

		# 定义反向传播操作
		trainable_varibles = tf.trainable_variables()
		# 控制梯度大小
		grads = tf.gradients(cost / tf.to_float(batch_size), trainable_varibles)
		grads, _ = tf.clip_by_global_norm(grads, MAX_GRAD_NORM)
		optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
		train_op = optimizer.apply_gradients(
			zip(grads, trainable_varibles)
		)
		return cost_per_token, train_op


# 使用给定的模型训练一个epoch
def run_epoch(session, cost_op, train_op, saver, step):
	# 重复训练步骤直至遍历完Dataset中所有数据
	while True:
		try:
			cost, _ = session.run([cost_op, train_op])
			if step % 10 == 0:
				print('After %d steps, per token cost is %.3f' % (step, cost))

			if step % 200 == 0:
				saver.save(session, CHECKPOINT_PAHT, global_step=step)
			step += 1
		except tf.errors.OutOfRangeError:
			break
	return step


def main():
	# 定义初始化函数
	initializer = tf.random_uniform_initializer(-0.05, 0.05)
	# 定义训练用的循环神经网络
	with tf.variable_scope('nmt_model', reuse=None, initializer=initializer):
		train_model = NMTModel()

	# 定义输入数据
	data = makeSrcTrgDataset(SRC_TRAIN_DATA, TRG_TRAIN_DATA, BATCH_SIZE)
	it = data.make_initializable_iterator()
	(src, src_size), (trg_input, trg_label, trg_size) = it.get_next()

	# 定义前向计算图
	cost_op, train_op = train_model.forward(src, src_size, trg_input, trg_label, trg_size)

	# 训练模型
	saver = tf.train.Saver()
	step = 0
	with tf.Session() as sess:
		tf.global_variables_initializer().run()
		for i in range(NUM_EPOCH):
			print('In iteration: %d' % (i + 1))
			sess.run(it.initializer)
			step = run_epoch(sess, cost_op, train_op, saver, step)


if __name__ == '__main__':
	main()

输出

In iteration: 1
After 0 steps, per token cost is 8.294
After 10 steps, per token cost is 7.463
After 20 steps, per token cost is 6.908
After 30 steps, per token cost is 6.786
After 40 steps, per token cost is 6.670
...
After 8940 steps, per token cost is 2.517
After 8950 steps, per token cost is 2.455
After 8960 steps, per token cost is 2.471
After 8970 steps, per token cost is 2.455
After 8980 steps, per token cost is 2.480
After 8990 steps, per token cost is 2.600
After 9000 steps, per token cost is 2.354
After 9010 steps, per token cost is 2.507

上面的程序完成了机器翻译模型的训练步骤，并将训练好的模型保存到checkpoint中。下面展示如何从checkpoint中读取模型并对一个新的句子进行翻译。对新输入的句子进行翻译的过程也称为解码(decoding)或推理(inference)。

解码器的实现与训练时有很大不同，因为训练时解码器可以从输出中读取完整的目标训练句子，因此可以用dynamic_rnn简单地展开成前馈网络。而在解码过程中，模型只能看到输入句子，不能看到目标句子。解码器在第一步读取<sos>符，预测目标句子的第一个单词，然后需要将这个预测的单词复制到第二步作为输入，再预测第二个单词，直到预测的单词为<eos>为止。

这个过程需要使用一个循环结构来实现。在TensorFlow中，循环结构是由tf.while_loop来实现的。使用方法如下：

# cond是一个函数，负责判断继续执行循环的条件
# loop_body是每个循环体内执行的操作，负责对循环状态进行更新
# init_state 为循环的起始状态，可以包含多个Tensor或TensorArray
# 返回的结果是循环结束时的循环状态
final_state = tf.while_loop(cond, loop_body, init_state)

tf.while_loop建立计算图的过程中没有真的进行循环，而是建立了一个包含循环逻辑的计算节点。在建立计算图的过程中，loop_body函数内的代码只执行一次。

在翻译之前，需要先定义单词到序号以及序号到单词的词典，代码如下：

EOS_ID = 2

def get_vocabulary(path):
    with open(path, 'r',encoding='utf-8') as f:
        vocab = [w.strip() for w in f.readlines()]
    word_2_id = {k:v for k,v in zip(vocab, range(len(vocab)))}
    id_2_word = {k:v for v,k in word_2_id.items()}
    
    return word_2_id, id_2_word
                                    
                                    
        
en_path = './datasets/ted/en.vocab'
zh_path = './datasets/ted/zh.vocab'

en_word_2_id, en_id_2_word = get_vocabulary(en_path)
zh_word_2_id, zh_id_2_word = get_vocabulary(zh_path)

    
def get_raw_txt(id_2_word,ids):
    txt = [id_2_word.get(i) for i in ids]
    return ' '.join(txt)

def get_en_ids(en_word_2_id,sentence):
    ids = sentence.split()
    return [en_word_2_id[w] for w in ids] +[EOS_ID]

然后下面的代码展示了如何用tf.while_loop来实现解码过程：

import tensorflow as tf

CHECKPOINT_PATH = './models/seq2seq_ckpt-9000'  # 训练程序在第9000步保存的checkpoint

# 模型参数，必须与训练时的保持一致
HIDDEN_SIZE = 1024  # LSTM的隐藏层大小
NUM_LAYERS = 2  # 深层循环神经网络中LSTM结构的层数
SRC_VOCAB_SIZE = 10000  # 源语言词汇表大小
TRG_VOCAB_SIZE = 4000  # 目标语言词汇表大小
SHARE_EMB_AND_SOFTMAX = True  # 在Softmax层和词向量层之间共享参数

# <sos>和<eos>的ID，解码过程中，<sos>作为第一步的输入
SOS_ID = 1
EOS_ID = 2

class NMTModel:
    def __init__(self):
        # 和训练时的init函数相同
        self.enc_cell = tf.nn.rnn_cell.MultiRNNCell(
            [tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE) for _ in range(NUM_LAYERS)]
        )
        self.dec_cell = tf.nn.rnn_cell.MultiRNNCell(
            [tf.nn.rnn_cell.BasicLSTMCell(HIDDEN_SIZE) for _ in range(NUM_LAYERS)]
        )

        # 为源语言和目标语言分别定义词向量
        self.src_embedding = tf.get_variable('src_emb', [SRC_VOCAB_SIZE, HIDDEN_SIZE])
        self.trg_embedding = tf.get_variable('trg_emb', [TRG_VOCAB_SIZE, HIDDEN_SIZE])

        # 定义softmax层的变量
        if SHARE_EMB_AND_SOFTMAX:
            self.softmax_weight = tf.transpose(self.trg_embedding)
        else:
            self.softmax_weight = tf.get_variable('weight', [HIDDEN_SIZE, TRG_VOCAB_SIZE])

        self.softmax_bias = tf.get_variable('softmax_bias', [TRG_VOCAB_SIZE])
    
    def inference(self, src_input):
        # 虽然输入只有一个句子，但dynamic_rnn要求输入的是batch的形式，因此将输入句子整理为大小为1的batch
        src_size = tf.convert_to_tensor([len(src_input)],dtype=tf.int32)
        src_input = tf.convert_to_tensor([src_input], dtype=tf.int32)
        src_emb = tf.nn.embedding_lookup(self.src_embedding,src_input)
        
        # 使用dynamic_rnn构建编码器，这一步与训练时相同
        with tf.variable_scope('encoder'):
            enc_outputs, enc_state = tf.nn.dynamic_rnn(self.enc_cell,src_emb,src_size,dtype=tf.float32)
            
        # 设置解码的最大步数
        MAX_DEC_LEN = 100
        
        with tf.variable_scope('decoder/rnn/multi_rnn_cell'):
            # 使用一个变长的TensorArray来存储生成的句子
            init_array = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True, clear_after_read=False)
            # 填入第一个单词<sos>作为解码器的输入
            init_array = init_array.write(0, SOS_ID)
            # 构建初始的循环状态：包含循环神经网络的隐藏状态，保存生成句子的
            # TensorArray，以及记录解码步数的一个整数step
            init_loop_var = (enc_state, init_array, 0)
            
            # tf.while_loop的循环条件：
            # 循环直到解码器输出<eos>，或达到最大步数
            def continue_loop_condition(state, trg_ids, step):
                return tf.reduce_all(tf.logical_and(
                    tf.not_equal(trg_ids.read(step), EOS_ID),
                    tf.less(step, MAX_DEC_LEN-1)
                ))
            
            def loop_body(state, trg_ids, step):
                # 读取最后一步输出的单词，并读取其词向量
                trg_input = [trg_ids.read(step)]
                trg_emb = tf.nn.embedding_lookup(self.trg_embedding, trg_input)
                
                # 这里直接调用dec_cell向前计算一步
                dec_outputs, next_state = self.dec_cell.call(
                    state=state, inputs=trg_emb
                )
                # 计算每个可能的输出单词对应的logit，并选取logit值最大的单词作为这一步的输出
                output = tf.reshape(dec_outputs, [-1, HIDDEN_SIZE])
                logits = (tf.matmul(output, self.softmax_weight) + self.softmax_bias)
                next_id = tf.argmax(logits, axis=1, output_type=tf.int32)
                # 将这一步输出的单词写入循环状态的trg_ids中
                trg_ids = trg_ids.write(step+1,next_id[0])
                return next_state, trg_ids, step+1
            
            # 执行tf.while_loop，返回最终状态
            state, trg_ids, step = tf.while_loop(
                continue_loop_condition, loop_body, init_loop_var
            )
            return trg_ids.stack()#提高维度

def main():
    # 定义训练用的循环神经网络模型
    with tf.variable_scope("nmt_model",reuse=None):
        model = NMTModel()
    # 定义一个测试例子，这里的例子是"This is a test."对应的ID
    test_sentence = get_en_ids(en_word_2_id,'This is a test .')
    # test_sentence = [11, 250, 487, 463, 360, 2]
    # 建立解码所需的计算图
    output_op = model.inference(test_sentence)
    sess = tf.Session()
    saver = tf.train.Saver()
    saver.restore(sess, CHECKPOINT_PATH)
    # 读取翻译结果
    output = sess.run(output_op)
    print(get_raw_txt(zh_id_2_word,output))
    sess.close()
    
if __name__ == '__main__':
    main()

输出：

<sos> 这 是 一 个 测 试 。 <eos>

在该语句上翻译成功。

注意力机制

在Seq2Seq模型中，编码器将完整的输入句子压缩到一个维度固定的向量中，然后解码器根据这个向量生成输出句子。当输入句子较长时，这个中间向量难以存储足够的信息，就称为这个模型的一个瓶颈。注意力机制就是为了解决这个问题而设计的。

注意力机制允许解码器随时查询输入句子中的部分单词或片段，因此不再需要在中间向量中存储所有信息。

在这里插入图片描述

(使用了注意力机制的Seq2Seq模型示意图)

上图概括性地展示了注意力机制的主要框架，下图给出了注意力模型中的细节。

在这里插入图片描述

(注意力机制的实现细节)

解码器在解码的每一步将隐藏状态作为查询的输入来查询编码器的隐藏状态，在每个输入的位置计算一个反映与查询输入相关程度的权重，再根据这个权重对各输入位置的隐藏状态求加权平均。加权平均后得到的向量称为"context"，表示它是与翻译当前单词最相关的原文信息。在解码下一个单词时，将context作为额外新输入到循环神经网络中，这样循环神经网络可以时刻读取原文中最相关的信息，而不必完全依赖于上一时刻的隐藏状态。

下面介绍注意力机制的数学定义。

在上图中， h i h_i hi表示编码器在第 i i i个单词上的输出， s j s_j sj是编码器在预测第 j j j个单词时的状态。计算 j j j时刻的context方法如下：
α i , j = e x p ( e ( h i , s j ) ) ∑ i e x p ( e ( h i , s j ) c o n t e x t j = ∑ i α i , j h i \alpha_{i,j} = \frac{exp\left( e(h_i,s_j)\right)}{ \sum_i exp \left( e(h_i,s_j \right)} \\ context_j = \sum_i \alpha_{i,j} h_i αi,j=∑iexp(e(hi,sj)exp(e(hi,sj))contextj=i∑αi,jhi

其中 e ( h i , s j ) e(h_i,s_j) e(hi,sj)是计算原文各单词与当前解码器状态的相关度的函数。常用的 e ( h , s ) e(h,s) e(h,s)函数定义式一个带有单个隐藏层的前馈神经网络：
e ( h , s ) = U tanh ⁡ ( V h + W s ) e(h,s) = U \tanh (Vh +Ws) e(h,s)=Utanh(Vh+Ws)

其中 U , V , W U,V,W U,V,W是模型的参数， e ( h , s ) e(h,s) e(h,s)构成了一个包含一个隐藏层的全连接神经网络。除此之外，注意力机制还有多种其他设计，如Minh-Thang Luong等人提出的 e ( h , s ) = h T W s e(h,s) = h^T Ws e(h,s)=hTWs，或直接使用两个状态之间的点乘 e ( h , s ) = h T s e(h,s) = h^T s e(h,s)=hTs。

在计算得到第 j j j步的context向量之后，context被加入到 j + 1 j+1 j+1时刻作为循环层的输入。假设 h h h的维度是hidden_src，词向量的维度是hidden_emb，那么在计算隐藏状态 s j s_j sj时，输入的维度是hidden_src+hidden_emb。通过context向量，解码器可以在解码的每一步查询最相关的原文信息，从而避免Seq2Seq模型中心瓶颈问题。

在这里插入图片描述

比较上面两个图，除增加了注意力机制以外，还可以看到亮点不同。

第一，编码器采用了一个双向循环网络。虽然Seq2Seq模型中也可以使用双向循环网络作为编码器，但在使用注意力机制时，这一选择将变得尤为重要。这是因为在解码器通过注意力查询一个单词时，通常也需要知道该单词周围的部分信息。

第二，这里取消了编码器和解码器之间的连接，编码器完全依赖于注意力机制获取原文信息。取消这一连接使得编码器和解码器可以自由选择模型。例如它们可以选择不同层数、不同维度、不同结构的循环神经网络，可以在编码器中使用双向LSTM，而在解码器中使用单向LSTM等。

TensorFlow中已经提供了几种预置的实现。tf.contrib.seq2seq.AttentionWrapper将解码器的循环神经网络层和注意层结合，成为一个更高层的神经网络。每一步计算的context在相邻解码步骤之间的传递，可以视为一个隐藏状态在相邻时刻之间的传递。将注意力机制封装成循环神经网络后，就可以使用dynamic_rnn调用新的包含注意力的循环神经网络。

使用下面的代码取代上小节__init__函数里的self.enc_cell：

self.enc_cell_fw = LSTMCell(HIDDEN_SIZE)  # forward 前向传播
self.enc_cell_bw = LSTMCell(HIDDEN_SIZE)  # backward 反向传播

使用下面代码取代forward函数的相应部分：

 # 使用dynamic_rnn构造编码器。
        # 编码器读取源句子每个位置的词向量，输出最后一步的隐藏状态enc_state。
        # 因为编码器是一个双层LSTM，因此enc_state是一个包含两个LSTMStateTuple类
        # 张量的tuple，每个LSTMStateTuple对应编码器中的一层。
        # 张量的维度是 [batch_size, HIDDEN_SIZE]。
        # enc_outputs是顶层LSTM在每一步的输出，它的维度是[batch_size, max_time, HIDDEN_SIZE]。
        # Seq2Seq模型中不需要用到enc_outputs，而attention模型会用到它。
        with tf.variable_scope('encoder'):
            # 使用bidiretional_dynamic_rnn构造双向循环网络
            # 双向循环网络的顶层输出enc_outputs是一个包含两个张量的tuple,每个张量的
            # 维度都是[batch_size, max_time, HIDDEN_SIZE],代表两个LSTM在每一步的输出
            enc_outputs, enc_state = tf.nn.bidirectional_dynamic_rnn(
                cell_fw=self.enc_cell_fw,  # 前向传播的单元
                cell_bw=self.enc_cell_bw,  # 反向传播的单元
                inputs=src_emb,
                sequence_length=src_size,
                dtype=tf.float32
            )
            # 将两个LSTM的输出拼接为一个张量
            enc_outputs = tf.concat([enc_outputs[0], enc_outputs[1]], -1)

        with tf.variable_scope('decoder'):
            # 选择注意力权重的计算模型。BahdanauAttention是使用一个隐藏层的前馈神经网络
            # memory_sequence_length是一个维度为[batch_size]的张量，代表batch
            # 中每个句子的长度，Attention需要根据这个信息把填充位置的注意力权重设置为0
            attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                num_units=HIDDEN_SIZE,
                memory=enc_outputs,
                memory_sequence_length=src_size
            )

            # 将解码器的循环神经网络self.dec_cell和注意力一起封装成更高层的循环神经网络
            attention_cell = tf.contrib.seq2seq.AttentionWrapper(
                cell=self.dec_cell,
                attention_mechanism=attention_mechanism,
                attention_layer_size=HIDDEN_SIZE
            )
            # 使用attention_cell和dynamic_rnn构造编码器
            # 注意这里没有指定init_state,完全依赖注意力作为信息来源

            dec_outputs, _ = tf.nn.dynamic_rnn(
                cell=attention_cell,
                inputs=trg_emb,
                sequence_length=trg_size,
                dtype=tf.float32
            )

注意力机制是一种高效获取新的方式。一方面，它使得解码器可以在每一步主动查询最相关的信息；另一方面，它大大缩短了信息流动的距离。有了注意力机制后，解码器在任意时刻只需一步就可以查阅输入的任意单词。

下面贴出完整代码。
首先是训练时：

import tensorflow as tf
from tensorflow.contrib.rnn import LSTMCell, MultiRNNCell

MAX_LEN = 50  # 限定句子的最大单词数量
SOS_ID = 1  # 目标词汇表中<sos>的ID
EOS_ID = 2  # 词汇表中<eos>的ID


# 使用Dataset从一个文件中读取一个语言的数据。
# 数据的格式为每行一句话，单词已经转化为单词编号。
def makeDataset(file_path):
    dataset = tf.data.TextLineDataset(file_path)
    # 根据空格将单词编号切分开并放入一个一维向量。
    dataset = dataset.map(lambda string: tf.string_split([string]).values)
    # 将字符串形式的单词编号转化为整数。
    dataset = dataset.map(
        lambda string: tf.string_to_number(string, tf.int32))
    # 统计每个句子的单词数量，并与句子内容一起放入Dataset中。
    dataset = dataset.map(lambda x: (x, tf.size(x)))
    return dataset


# 从源语言文件src_path和目标语言文件trg_path中分别读取数据，并进行填充和
# batching操作。
def makeSrcTrgDataset(src_path, trg_path, batch_size):
    # 首先分别读取源语言数据和目标语言数据。
    src_data = makeDataset(src_path)
    trg_data = makeDataset(trg_path)
    # 通过zip操作将两个Dataset合并为一个Dataset。现在每个Dataset中每一项数据ds
    # 由4个张量组成：
    #   ds[0][0]是源句子
    #   ds[0][1]是源句子长度
    #   ds[1][0]是目标句子
    #   ds[1][1]是目标句子长度
    dataset = tf.data.Dataset.zip((src_data, trg_data))

    # 删除内容为空（只包含<EOS>）的句子和长度过长的句子。
    def filterLength(src_tuple, trg_tuple):
        ((src_input, src_len), (trg_label, trg_len)) = (src_tuple, trg_tuple)
        src_len_ok = tf.logical_and(
            tf.greater(src_len, 1), tf.less_equal(src_len, MAX_LEN))
        trg_len_ok = tf.logical_and(
            tf.greater(trg_len, 1), tf.less_equal(trg_len, MAX_LEN))
        return tf.logical_and(src_len_ok, trg_len_ok)

    dataset = dataset.filter(filterLength)

    # 从图9-5可知，解码器需要两种格式的目标句子：
    #   1.解码器的输入(trg_input)，形式如同"<sos> X Y Z"
    #   2.解码器的目标输出(trg_label)，形式如同"X Y Z <eos>"
    # 上面从文件中读到的目标句子是"X Y Z <eos>"的形式，我们需要从中生成"<sos> X Y Z"
    # 形式并加入到Dataset中。
    def makeTrgInput(src_tuple, trg_tuple):
        ((src_input, src_len), (trg_label, trg_len)) = (src_tuple, trg_tuple)
        trg_input = tf.concat([[SOS_ID], trg_label[:-1]], axis=0)
        return ((src_input, src_len), (trg_input, trg_label, trg_len))

    dataset = dataset.map(makeTrgInput)

    # 随机打乱训练数据。
    dataset = dataset.shuffle(10000)

    # 规定填充后输出的数据维度。
    padded_shapes = (
        (tf.TensorShape([None]),  # 源句子是长度未知的向量
         tf.TensorShape([])),  # 源句子长度是单个数字
        (tf.TensorShape([None]),  # 目标句子（解码器输入）是长度未知的向量
         tf.TensorShape([None]),  # 目标句子（解码器目标输出）是长度未知的向量
         tf.TensorShape([])))  # 目标句子长度是单个数字
    # 调用padded_batch方法进行batching操作。
    batched_dataset = dataset.padded_batch(batch_size, padded_shapes)
    return batched_dataset


# 输入数据已经转换成了单词编号的格式
SRC_TRAIN_DATA = './datasets/ted/train.en'  # 源语言输入文件
TRG_TRAIN_DATA = './datasets/ted/train.zh'  # 目标语言输入文件
CHECKPOINT_PATH = './models/seq2seq_ckpt'  # checkpoint保存路径
HIDDEN_SIZE = 1024  # LSTM的隐藏层大小
DECODER_LAYERS = 2  # 深层循环神经网络中解码器LSTM结构的层数
SRC_VOCAB_SIZE = 10000  # 源语言词汇表大小
TRG_VOCAB_SIZE = 4000  # 目标语言词汇表大小
BATCH_SIZE = 100  # 训练数据batch大小
NUM_EPOCH = 5  # 使用训练数据的轮数
KEEP_PROB = 0.8  # 节点不被dropout的概率
MAX_GRAD_NORM = 5  # 用于控制梯度膨胀的梯度大小上限
SHARE_EMB_AND_SOFTMAX = True  # 在Softmax层和词向量层之间共享参数


class NMTModel:
    def __init__(self):
        # 定义编码器和解码器所使用的LSTM结构
        self.enc_cell_fw = LSTMCell(HIDDEN_SIZE)  # forward 前向传播
        self.enc_cell_bw = LSTMCell(HIDDEN_SIZE)  # backward 反向传播

        self.dec_cell = MultiRNNCell(
            [LSTMCell(HIDDEN_SIZE) for _ in range(DECODER_LAYERS)]
        )

        # 为源语言和目标语言分别定义词向量
        self.src_embedding = tf.get_variable('src_emb', [SRC_VOCAB_SIZE, HIDDEN_SIZE])
        self.trg_embedding = tf.get_variable('trg_emb', [TRG_VOCAB_SIZE, HIDDEN_SIZE])

        # 定义softmax层的变量
        if SHARE_EMB_AND_SOFTMAX:
            self.softmax_weight = tf.transpose(self.trg_embedding)
        else:
            self.softmax_weight = tf.get_variable('weight', [HIDDEN_SIZE, TRG_VOCAB_SIZE])

        self.softmax_bias = tf.get_variable('softmax_bias', [TRG_VOCAB_SIZE])

    # 在forward函数中定义模型的前向计算图
    # src_input, src_size, trg_input, trg_label, trg_size是 makeSrcTrgDataset函数产生的五种张量
    def forward(self, src_input, src_size, trg_input, trg_label, trg_size):
        batch_size = tf.shape(src_input)[0]
        # 将输入和输出单词编号转为词向量
        src_emb = tf.nn.embedding_lookup(self.src_embedding, src_input)
        trg_emb = tf.nn.embedding_lookup(self.trg_embedding, trg_input)

        # 在词向量上进行dropout
        src_emb = tf.nn.dropout(src_emb, KEEP_PROB)
        trg_emb = tf.nn.dropout(trg_emb, KEEP_PROB)

        # 使用dynamic_rnn构造编码器。
        # 编码器读取源句子每个位置的词向量，输出最后一步的隐藏状态enc_state。
        # 因为编码器是一个双层LSTM，因此enc_state是一个包含两个LSTMStateTuple类
        # 张量的tuple，每个LSTMStateTuple对应编码器中的一层。
        # 张量的维度是 [batch_size, HIDDEN_SIZE]。
        # enc_outputs是顶层LSTM在每一步的输出，它的维度是[batch_size, max_time, HIDDEN_SIZE]。
        # Seq2Seq模型中不需要用到enc_outputs，而attention模型会用到它。
        with tf.variable_scope('encoder'):
            # 使用bidiretional_dynamic_rnn构造双向循环网络
            # 双向循环网络的顶层输出enc_outputs是一个包含两个张量的tuple,每个张量的
            # 维度都是[batch_size, max_time, HIDDEN_SIZE],代表两个LSTM在每一步的输出
            enc_outputs, enc_state = tf.nn.bidirectional_dynamic_rnn(
                cell_fw=self.enc_cell_fw,  # 前向传播的单元
                cell_bw=self.enc_cell_bw,  # 反向传播的单元
                inputs=src_emb,
                sequence_length=src_size,
                dtype=tf.float32
            )
            # 将两个LSTM的输出拼接为一个张量
            enc_outputs = tf.concat([enc_outputs[0], enc_outputs[1]], -1)

        with tf.variable_scope('decoder'):
            # 选择注意力权重的计算模型。BahdanauAttention是使用一个隐藏层的前馈神经网络
            # memory_sequence_length是一个维度为[batch_size]的张量，代表batch
            # 中每个句子的长度，Attention需要根据这个信息把填充位置的注意力权重设置为0
            attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                num_units=HIDDEN_SIZE,
                memory=enc_outputs,
                memory_sequence_length=src_size
            )

            # 将解码器的循环神经网络self.dec_cell和注意力一起封装成更高层的循环神经网络
            attention_cell = tf.contrib.seq2seq.AttentionWrapper(
                cell=self.dec_cell,
                attention_mechanism=attention_mechanism,
                attention_layer_size=HIDDEN_SIZE
            )
            # 使用attention_cell和dynamic_rnn构造编码器
            # 注意这里没有指定init_state,完全依赖注意力作为信息来源

            dec_outputs, _ = tf.nn.dynamic_rnn(
                cell=attention_cell,
                inputs=trg_emb,
                sequence_length=trg_size,
                dtype=tf.float32
            )

        # 计算解码器每一步的log perplexity。这一步与语言模型代码相同
        output = tf.reshape(dec_outputs, [-1, HIDDEN_SIZE])
        logits = tf.matmul(output, self.softmax_weight) + self.softmax_bias
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
            labels=tf.reshape(trg_label, [-1]), logits=logits
        )

        # 在计算平均损失时，需要将填充位置的权重设置为0，以避免无效位置的预测干扰模型的训练
        label_weights = tf.sequence_mask(
            trg_size, maxlen=tf.shape(trg_label)[1], dtype=tf.float32
        )
        label_weights = tf.reshape(label_weights, [-1])
        cost = tf.reduce_sum(loss * label_weights)
        cost_per_token = cost / tf.reduce_sum(label_weights)

        # 定义反向传播操作
        trainable_variables = tf.trainable_variables()
        # 控制梯度大小
        grads = tf.gradients(cost / tf.to_float(batch_size), trainable_variables)
        grads, _ = tf.clip_by_global_norm(grads, MAX_GRAD_NORM)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0)
        train_op = optimizer.apply_gradients(
            zip(grads, trainable_variables)
        )
        return cost_per_token, train_op


# 使用给定的模型训练一个epoch
def run_epoch(session, cost_op, train_op, saver, step):
    # 重复训练步骤直至遍历完Dataset中所有数据
    while True:
        try:
            cost, _ = session.run([cost_op, train_op])
            if step % 10 == 0:
                print('After %d steps, per token cost is %.3f' % (step, cost))

            if step % 200 == 0:
                saver.save(session, CHECKPOINT_PATH, global_step=step)
            step += 1
        except tf.errors.OutOfRangeError:
            break
    return step


def main():
    # 定义初始化函数
    initializer = tf.random_uniform_initializer(-0.05, 0.05)
    # 定义训练用的循环神经网络
    with tf.variable_scope('nmt_model', reuse=None, initializer=initializer):
        train_model = NMTModel()

    # 定义输入数据
    data = makeSrcTrgDataset(SRC_TRAIN_DATA, TRG_TRAIN_DATA, BATCH_SIZE)
    it = data.make_initializable_iterator()
    (src, src_size), (trg_input, trg_label, trg_size) = it.get_next()

    # 定义前向计算图
    cost_op, train_op = train_model.forward(src, src_size, trg_input, trg_label, trg_size)

    # 训练模型
    saver = tf.train.Saver()
    step = 0
    with tf.Session() as sess:
        tf.global_variables_initializer().run()
        for i in range(NUM_EPOCH):
            print('In iteration: %d' % (i + 1))
            sess.run(it.initializer)
            step = run_epoch(sess, cost_op, train_op, saver, step)


if __name__ == '__main__':
    main()

接下来测试时：

import tensorflow as tf
from tensorflow.contrib.rnn import LSTMCell, MultiRNNCell

# <sos>和<eos>的ID，解码过程中，<sos>作为第一步的输入
SOS_ID = 1
EOS_ID = 2


def get_vocabulary(path):
    with open(path, 'r', encoding='utf-8') as f:
        vocab = [w.strip() for w in f.readlines()]
    word_2_id = {k: v for k, v in zip(vocab, range(len(vocab)))}
    id_2_word = {k: v for v, k in word_2_id.items()}

    return word_2_id, id_2_word


en_path = './datasets/ted/en.vocab'
zh_path = './datasets/ted/zh.vocab'

en_word_2_id, en_id_2_word = get_vocabulary(en_path)
zh_word_2_id, zh_id_2_word = get_vocabulary(zh_path)


def get_raw_txt(id_2_word, ids):
    txt = [id_2_word.get(i) for i in ids]
    return ' '.join(txt)


def get_en_ids(en_word_2_id, sentence):
    ids = sentence.split()
    return [en_word_2_id[w] for w in ids] + [EOS_ID]


CHECKPOINT_PATH = './models/seq2seq_ckpt-9000'  # 训练程序在第9000步保存的checkpoint

# 模型参数，必须与训练时的保持一致
HIDDEN_SIZE = 1024  # LSTM的隐藏层大小
NUM_LAYERS = 2  # 深层循环神经网络中LSTM结构的层数
SRC_VOCAB_SIZE = 10000  # 源语言词汇表大小
TRG_VOCAB_SIZE = 4000  # 目标语言词汇表大小
SHARE_EMB_AND_SOFTMAX = True  # 在Softmax层和词向量层之间共享参数


class NMTModel:
    def __init__(self):
        # 和训练时的init函数相同
        # 定义编码器和解码器所使用的LSTM结构
        self.enc_cell_fw = LSTMCell(HIDDEN_SIZE)  # forward 前向传播
        self.enc_cell_bw = LSTMCell(HIDDEN_SIZE)  # backward 反向传播
        self.dec_cell = MultiRNNCell(
            [LSTMCell(HIDDEN_SIZE) for _ in range(NUM_LAYERS)]
        )

        # 为源语言和目标语言分别定义词向量
        self.src_embedding = tf.get_variable('src_emb', [SRC_VOCAB_SIZE, HIDDEN_SIZE])
        self.trg_embedding = tf.get_variable('trg_emb', [TRG_VOCAB_SIZE, HIDDEN_SIZE])

        # 定义softmax层的变量
        if SHARE_EMB_AND_SOFTMAX:
            self.softmax_weight = tf.transpose(self.trg_embedding)
        else:
            self.softmax_weight = tf.get_variable('weight', [HIDDEN_SIZE, TRG_VOCAB_SIZE])

        self.softmax_bias = tf.get_variable('softmax_bias', [TRG_VOCAB_SIZE])

    def inference(self, src_input):
        # 虽然输入只有一个句子，但dynamic_rnn要求输入的是batch的形式，因此将输入句子整理为大小为1的batch
        src_size = tf.convert_to_tensor([len(src_input)], dtype=tf.int32)
        src_input = tf.convert_to_tensor([src_input], dtype=tf.int32)
        src_emb = tf.nn.embedding_lookup(self.src_embedding, src_input)

        with tf.variable_scope('encoder'):
            # 使用bidiretional_dynamic_rnn构造双向循环网络
            # 双向循环网络的顶层输出enc_outputs是一个包含两个张量的tuple,每个张量的
            # 维度都是[batch_size, max_time, HIDDEN_SIZE],代表两个LSTM在每一步的输出
            enc_outputs, enc_state = tf.nn.bidirectional_dynamic_rnn(
                cell_fw=self.enc_cell_fw,  # 前向传播的单元
                cell_bw=self.enc_cell_bw,  # 反向传播的单元
                inputs=src_emb,
                sequence_length=src_size,
                dtype=tf.float32
            )
            # 将两个LSTM的输出拼接为一个张量
            enc_outputs = tf.concat([enc_outputs[0], enc_outputs[1]], -1)

        with tf.variable_scope('decoder'):
            # 选择注意力权重的计算模型。BahdanauAttention是使用一个隐藏层的前馈神经网络
            # memory_sequence_length是一个维度为[batch_size]的张量，代表batch
            # 中每个句子的长度，Attention需要根据这个信息把填充位置的注意力权重设置为0
            attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                num_units=HIDDEN_SIZE,
                memory=enc_outputs,
                memory_sequence_length=src_size
            )

            # 将解码器的循环神经网络self.dec_cell和注意力一起封装成更高层的循环神经网络
            attention_cell = tf.contrib.seq2seq.AttentionWrapper(
                cell=self.dec_cell,
                attention_mechanism=attention_mechanism,
                attention_layer_size=HIDDEN_SIZE
            )

        # 设置解码的最大步数
        MAX_DEC_LEN = 100

        with tf.variable_scope("decoder/rnn/attention_wrapper"):
            # 使用一个变长的TensorArray来存储生成的句子
            init_array = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True, clear_after_read=False)
            # 填入第一个单词<sos>作为解码器的输入
            init_array = init_array.write(0, SOS_ID)
            # 调用attention_cell.zero_state构建初始的循环状态。循环状态包含
            # 循环神经网络的隐藏状态，保存生成句子的TensorArray，以及记录解码
            # 步数的一个整数step。
            init_loop_var = (
                attention_cell.zero_state(batch_size=1, dtype=tf.float32),
                init_array, 0)

            # tf.while_loop的循环条件：
            # 循环直到解码器输出<eos>，或达到最大步数
            def continue_loop_condition(state, trg_ids, step):
                return tf.reduce_all(tf.logical_and(
                    tf.not_equal(trg_ids.read(step), EOS_ID),  # 输出<eos>
                    tf.less(step, MAX_DEC_LEN - 1)  # 达到最大步数
                ))

            def loop_body(state, trg_ids, step):
                # 读取最后一步输出的单词，并读取其词向量
                trg_input = [trg_ids.read(step)]
                trg_emb = tf.nn.embedding_lookup(self.trg_embedding, trg_input)

                # 这里直接调用dec_cell向前计算一步
                dec_outputs, next_state = attention_cell.call(
                    state=state, inputs=trg_emb
                )
                # 计算每个可能的输出单词对应的logit，并选取logit值最大的单词作为这一步的输出
                output = tf.reshape(dec_outputs, [-1, HIDDEN_SIZE])
                logits = (tf.matmul(output, self.softmax_weight) + self.softmax_bias)
                next_id = tf.argmax(logits, axis=1, output_type=tf.int32)
                # 将这一步输出的单词写入循环状态的trg_ids中
                trg_ids = trg_ids.write(step + 1, next_id[0])
                return next_state, trg_ids, step + 1 # 返回的next_state, trg_ids, step + 1 当成新的参数 传入循环体

            # 执行tf.while_loop，返回最终状态
            state, trg_ids, step = tf.while_loop(
                continue_loop_condition, loop_body, init_loop_var
            )
            return trg_ids.stack()  # 提高维度


def main():
    # 定义训练用的循环神经网络模型
    with tf.variable_scope("nmt_model", reuse=None):
        model = NMTModel()
    # 定义一个测试例子，这里的例子是"This is a test."对应的ID
    test_sentence = get_en_ids(en_word_2_id, 'This is a test .')
    # test_sentence = [11, 250, 487, 463, 360, 2]
    # 建立解码所需的计算图
    output_op = model.inference(test_sentence)
    sess = tf.Session()
    saver = tf.train.Saver()
    saver.restore(sess, CHECKPOINT_PATH) # 读取并加载参数到模型
    # 读取翻译结果
    output = sess.run(output_op)
    print(get_raw_txt(zh_id_2_word, output))
    sess.close()


if __name__ == '__main__':
    main()

标签：size,src,batch,trg,笔记,tf,TensorFlow,自然语言,SIZE
来源： https://blog.csdn.net/yjw123456/article/details/113793229

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

TensorFlow学习笔记——自然语言处理

引言

语言模型的背景知识

语言模型简介

语言模型的评价方法

神经语言模型

PTB数据集的预处理

PTB数据的batching方法

基于循环神经网络的神经语言模型

词向量层

Softmax层

通过共享参数减少参数数量

完整的训练程序

神经网络机器翻译

机器翻译背景与Seq2Seq模型介绍

机器翻译文本数据的预处理

Seq2Seq模型的代码实现

注意力机制