ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型-子词分词器

2021-11-23 13:02:53  阅读:292  来源: 互联网

标签:NLP Transformer datasets 星空 机器人 分词器 tensorflow


自然语言处理NLP星空智能对话机器人系列:理解语言的 Transformer 模型

本文是将葡萄牙语翻译成英语的一个高级示例。

目录

安装部署 Tensorflow

import tensorflow_datasets as tfds
import tensorflow as tf

import time
import numpy as np
import matplotlib.pyplot as plt

运行报错,提示

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-4c94d8100fcf> in <module>
----> 1 import tensorflow_datasets as tfds
      2 import tensorflow as tf
      3 
      4 import time
      5 import numpy as np

ModuleNotFoundError: No module named 'tensorflow_datasets'

安装tensorflow_datasets

(base) C:\Users\admin>activate my_star_space

(my_star_space) C:\Users\admin>pip install tensorflow-datasets
Collecting tensorflow-datasets
  Using cached tensorflow_datasets-4.4.0-py3-none-any.whl (4.0 MB)
Requirement already satisfied: dill in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.3.4)
Collecting tensorflow-metadata
  Downloading tensorflow_metadata-1.2.0-py3-none-any.whl (48 kB)
     |████████████████████████████████| 48 kB 21 kB/s
Requirement already satisfied: dataclasses in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.8)
Requirement already satisfied: importlib-resources in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (5.2.2)
Requirement already satisfied: promise in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.3)
Requirement already satisfied: tqdm in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (4.62.2)
Requirement already satisfied: attrs>=18.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (21.2.0)
Requirement already satisfied: requests>=2.19.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.26.0)
Requirement already satisfied: six in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.16.0)
Requirement already satisfied: future in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: numpy in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.19.5)
Requirement already satisfied: absl-py in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.13.0)
Requirement already satisfied: typing-extensions in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.7.4.3)
Requirement already satisfied: protobuf>=3.12.2 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.17.3)
Requirement already satisfied: termcolor in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: certifi>=2017.4.17 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2021.5.30)
Requirement already satisfied: idna<4,>=2.5 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (1.25.11)
Requirement already satisfied: charset-normalizer~=2.0.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2.0.4)
Requirement already satisfied: zipp>=3.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from importlib-resources->tensorflow-datasets) (3.5.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-metadata->tensorflow-datasets) (1.53.0)
Collecting absl-py
  Downloading absl_py-0.12.0-py3-none-any.whl (129 kB)
     |████████████████████████████████| 129 kB 14 kB/s
Requirement already satisfied: colorama in e:\anaconda3\envs\my_star_space\lib\site-packages (from tqdm->tensorflow-datasets) (0.4.4)
Installing collected packages: absl-py, tensorflow-metadata, tensorflow-datasets
  Attempting uninstall: absl-py
    Found existing installation: absl-py 0.13.0
    Uninstalling absl-py-0.13.0:
      Successfully uninstalled absl-py-0.13.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.6.0 requires six~=1.15.0, but you have six 1.16.0 which is incompatible.
Successfully installed absl-py-0.12.0 tensorflow-datasets-4.4.0 tensorflow-metadata-1.2.0
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'e:\anaconda3\envs\my_star_space\python.exe -m pip install --upgrade pip' command.

(my_star_space) C:\Users\admin>pip install tensorflow-datasets
Requirement already satisfied: tensorflow-datasets in e:\anaconda3\envs\my_star_space\lib\site-packages (4.4.0)
Requirement already satisfied: promise in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.3)
Requirement already satisfied: future in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.18.2)
Requirement already satisfied: numpy in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.19.5)
Requirement already satisfied: absl-py in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.12.0)
Requirement already satisfied: termcolor in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.1.0)
Requirement already satisfied: six in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.16.0)
Requirement already satisfied: tensorflow-metadata in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (1.2.0)
Requirement already satisfied: dataclasses in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.8)
Requirement already satisfied: requests>=2.19.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (2.26.0)
Requirement already satisfied: importlib-resources in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (5.2.2)
Requirement already satisfied: typing-extensions in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.7.4.3)
Requirement already satisfied: protobuf>=3.12.2 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (3.17.3)
Requirement already satisfied: tqdm in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (4.62.2)
Requirement already satisfied: dill in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (0.3.4)
Requirement already satisfied: attrs>=18.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-datasets) (21.2.0)
Requirement already satisfied: certifi>=2017.4.17 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2021.5.30)
Requirement already satisfied: charset-normalizer~=2.0.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (1.25.11)
Requirement already satisfied: idna<4,>=2.5 in e:\anaconda3\envs\my_star_space\lib\site-packages (from requests>=2.19.0->tensorflow-datasets) (3.2)
Requirement already satisfied: zipp>=3.1.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from importlib-resources->tensorflow-datasets) (3.5.0)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in e:\anaconda3\envs\my_star_space\lib\site-packages (from tensorflow-metadata->tensorflow-datasets) (1.53.0)
Requirement already satisfied: colorama in e:\anaconda3\envs\my_star_space\lib\site-packages (from tqdm->tensorflow-datasets) (0.4.4)
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the 'e:\anaconda3\envs\my_star_space\python.exe -m pip install --upgrade pip' command.

(my_star_space) C:\Users\admin>

设置输入pipeline

使用 TFDS 来导入 葡萄牙语-英语翻译数据集,该数据集来自于 TED 演讲开放翻译项目. 数据集包含来约 50000 条训练样本,1100 条验证样本,以及 2000 条测试样本。
在这里插入图片描述
在这里插入图片描述

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

下载的时间较长,运行结果如下

  Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\admin\tensorflow_datasets\ted_hrlr_translate\pt_to_en\1.0.0...
Dl Completed...: 100%
1/1 [2:57:36<00:00, 10649.11s/ url]
Dl Size...: 100%
124/124 [2:57:36<00:00, 93.26s/ MiB]
Extraction completed...: 100%
1/1 [2:57:36<00:00, 10656.49s/ file]
Dataset ted_hrlr_translate downloaded and prepared to C:\Users\admin\tensorflow_datasets\ted_hrlr_translate\pt_to_en\1.0.0. Subsequent calls will reuse this data.

下载的文件保存在本地
在这里插入图片描述
其中dataset_info.json的内容为

{
  "citation": "@inproceedings{Ye2018WordEmbeddings,\n  author  = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig},\n  title   = {When and Why are pre-trained word embeddings useful for Neural Machine Translation},\n  booktitle = {HLT-NAACL},\n  year    = {2018},\n  }",
  "configDescription": "Translation dataset from pt to en in plain text.",
  "configName": "pt_to_en",
  "description": "Data sets derived from TED talk transcripts for comparing similar language pairs\nwhere one is high resource and the other is low resource.",
  "downloadSize": "131005909",
  "fileFormat": "tfrecord",
  "location": {
    "urls": [
      "https://github.com/neulab/word-embeddings-for-nmt"
    ]
  },
  "moduleName": "tensorflow_datasets.translate.ted_hrlr",
  "name": "ted_hrlr_translate",
  "splits": [
    {
      "name": "train",
      "numBytes": "10806586",
      "shardLengths": [
        "51785"
      ]
    },
    {
      "name": "validation",
      "numBytes": "231285",
      "shardLengths": [
        "1193"
      ]
    },
    {
      "name": "test",
      "numBytes": "383883",
      "shardLengths": [
        "1803"
      ]
    }
  ],
  "supervisedKeys": {
    "input": "pt",
    "output": "en"
  },
  "version": "1.0.0"
}

features.json文件的内容为:

{
    "type": "tensorflow_datasets.core.features.translation_feature.Translation",
    "content": {
        "languages": [
            "en",
            "pt"
        ]
    }
}

ted_hrlr_translate-test.tfrecord-00000-of-00001的格式
在这里插入图片描述

从训练数据集创建自定义子词分词器subwords tokenizer

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

运行脚本如下

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c90f5c60daf2> in <module>
----> 1 tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
      2     (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
      3 
      4 tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
      5     (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'

提示报错,使用tfds.deprecated.text

tokenizer_en =tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

sample_string = 'Transformer is awesome.'

tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer_en.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

assert original_string == sample_string

运行结果如下

Tokenized string is [7915, 1248, 7946, 7194, 13, 2799, 7877]
The original string: Transformer is awesome.

如果单词不在词典中,则分词器(tokenizer)通过将单词分解为子词来对字符串进行编码。

我们看一下示例

for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))

运行结果如下:

7915 ----> T
1248 ----> ran
7946 ----> s
7194 ----> former 
13 ----> is 
2799 ----> awesome
7877 ----> .

将开始和结束标记(token)添加到输入和目标

BUFFER_SIZE = 20000
BATCH_SIZE = 64

def encode(lang1, lang2):
  lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
      lang1.numpy()) + [tokenizer_pt.vocab_size+1]

  lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
      lang2.numpy()) + [tokenizer_en.vocab_size+1]
  
  return lang1, lang2

为了使示例较小且相对较快,删除长度大于40个标记的样本

MAX_LENGTH = 40
def filter_max_length(x, y, max_length=MAX_LENGTH):
  return tf.logical_and(tf.size(x) <= max_length,
                        tf.size(y) <= max_length)
def tf_encode(pt, en):
  result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
  result_pt.set_shape([None])
  result_en.set_shape([None])

  return result_pt, result_en
train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
# 将数据集缓存到内存中以加快读取速度。
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)


val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)
pt_batch, en_batch = next(iter(val_dataset))
pt_batch, en_batch

运行结果如下

(<tf.Tensor: shape=(64, 38), dtype=int64, numpy=
 array([[8214,  342, 3032, ...,    0,    0,    0],
        [8214,   95,  198, ...,    0,    0,    0],
        [8214, 4479, 7990, ...,    0,    0,    0],
        ...,
        [8214,  584,   12, ...,    0,    0,    0],
        [8214,   59, 1548, ...,    0,    0,    0],
        [8214,  118,   34, ...,    0,    0,    0]], dtype=int64)>,
 <tf.Tensor: shape=(64, 40), dtype=int64, numpy=
 array([[8087,   98,   25, ...,    0,    0,    0],
        [8087,   12,   20, ...,    0,    0,    0],
        [8087,   12, 5453, ...,    0,    0,    0],
        ...,
        [8087,   18, 2059, ...,    0,    0,    0],
        [8087,   16, 1436, ...,    0,    0,    0],
        [8087,   15,   57, ...,    0,    0,    0]], dtype=int64)>)

附录 最终的运行结果

葡萄牙语翻译成英语
在这里插入图片描述

参考文献

https://tensorflow.google.cn/tutorials/text/transformer

星空智能对话机器人系列博客

标签:NLP,Transformer,datasets,星空,机器人,分词器,tensorflow
来源: https://blog.csdn.net/duan_zhihua/article/details/121479623

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有