python深度学习笔记01-电影评论分类

2021-09-09 11:33:03 阅读：260 来源： 互联网

标签：loss 01 10000 val python 笔记 train plt history

虽说已经对神经网络有所了解，但是理解和实践还是有区别的。例如，理解的时候，我们需要不断的推公式，实践的时候，平常只需要使用已有的函数进行调用就好。也就是不用自己做前向传播和后向传播，只需要用函数把神经网络构建好，选好优化器、损失函数和指标，最后调用拟合函数设置拟合次数就好。这就是从《python深度学习》这本书中第三章第一个例题得出的道理。

注：本博客的知识都能在博主之前的机器学习笔记中学到，博主也还在学习阶段，希望大佬指导指导。

问题概述

电影评论分类是一个二分类问题，主要是根据电影评论的文字内容进行划分为正面评论或负面评论这两点。

所需材料

tensorflow库：里面内置了Keras库，是用于神经网络计算的库，如今有pytorch抢占先头。

IMDB数据集：是内置于Keras库中的一个数据集，已经经过了预处理，把评论内容（单词序列）转换为整数序列，其中每个整数代表字典中的某个单词。（例如：1==as）

numpy库：用于把列表数据向量化

matplotlib库：用于画图

实现步骤

加载IMDB数据集
把训练数据向量化（one-hot）
定义网络模型
定义编译模型
把数据集分出得到验证集
训练模型
查看损失和精度

细节步骤

1、加载IMDB数据集

通过tensorflow内置的Kera库得到imdb对象，在从imdb中加载数据

from tensorflow.keras.datasets import imdb
# 仅保留训练数据中前10000个最常出现的单词
(train_data,train_labels),(test_data,test_labels)=imdb.load_data(num_words=10000)
train_data[0]

1注：

由于是已预处理过的数据，所以可以把整数数据从新变回评论的数据：

# word_index是一个将单词映射为整数索引的字典
word_index = imdb.get_word_index()
# 键值颠倒，将整数索引映射为单词
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])
# 将评论解码。注意，索引减去了3，因为0、1、2是为“padding”（填充）、“start of sequence”（序列开始）、“unknown”（未知词）分别保留索引
decoded_review = ' '.join([reverse_word_index.get(i-3,'?') for i in train_data[0]])
decoded_review

output:

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

2、把训练数据向量化（one-hot）

这里是把每一句评论（训练集和测试集）进行one-hot处理，这样每一条训练数据和测试数据合起来就能形成矩阵了。

import numpy as np

def vectorize_sequences(sequences,dimension = 10000):
#     创建一个形状为(len(sequences),dimension)的零矩阵
    results = np.zeros((len(sequences),dimension))
#     enumerate()函数是用于枚举的函数，别忘了
    for i,sequence in enumerate(sequences):
#         将result[i]的制定索引设为1
        results[i,sequence] = 1.
    return results

# 将训练数据向量化
x_train = vectorize_sequences(train_data)
# 将测试数据向量化
x_test = vectorize_sequences(test_data)
x_train

output:

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

把标签（y）向量化：

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

3、定义网络模型

构建一个网络需要引用Keras库中的models包和layers包进行构建。

from tensorflow.keras import models
from tensorflow.keras import layers

#生产一个模型
model = models.Sequential()
#加入第一层隐藏层，里面有16个隐藏元，用relu函数作为激活函数，以（10000*r）的矩阵作为输入层
model.add(layers.Dense(16,activation='relu',input_shape=(10000,)))
#加入第二层隐藏层，里面有16个隐藏元，用relu函数作为激活函数，以第一层隐藏层得到的矩阵作为输入层
model.add(layers.Dense(16,activation='relu'))
#加入输出层，里面有1个输出元，用sigmold函数作为激活函数，以第二层隐藏层得到的矩阵作为输入层
model.add(layers.Dense(1,activation='sigmoid'))

4、定义编译模型

设置网络的优化函数、损失函数和指标。这里的rmsprop、binary_crossentropy、accuracy都是kera库里面配置好的，能直接调用。当然，也是可以调用自己做好的函数，具体怎么用需要自己另找了。

model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

5、把数据集分出得到验证集

训练集变成15000，验证集变成10000.

x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

6、训练模型

用512个样本组成小批量，将每一批进行训练20个轮次的训练，并查看10000个样本上的损失和精度。

history =model.fit(partial_x_train,
                   partial_y_train,
                   epochs=20,
                   batch_size=512,
                   validation_data=(x_val,y_val))

7、查看损失和精度

history函数能够查看训练过程中的所有数据。

history_dict = history.history
history_dict.keys()

output：

dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

画图使用matplotlib库，先画训练损失和验证损失：

import matplotlib.pyplot as plt
history_dict = history.history
# 训练损失
loss_values = history_dict['loss']
# 验证损失
val_loss_values = history_dict['val_loss']
epochs = range(1,len(loss_values)+1)

plt.plot(epochs,loss_values,'bo',label='Training loss')
plt.plot(epochs,val_loss_values,'b',label = 'Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

请添加图片描述

已经可以判断这是过拟合了。

再看看训练精度和验证精度：

# 清空图像
plt.clf()
# 训练精度
acc = history_dict['acc']
# 验证精度
val_acc = history_dict['val_acc']

plt.plot(epochs,acc,'bo',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('aAccuracy')
plt.legend()
plt.show()