ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

自然语言处理NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Summarizing documents with T5-large

2021-10-14 20:31:21  阅读:319  来源: 互联网

标签:NLP Transformer 星空 机器人 对话 自然语言


自然语言处理NLP星空智能对话机器人系列:深入理解Transformer自然语言处理 Summarizing documents with T5-large

目录

Summarizing documents with T5-large

我们将创建一个摘要函数,可以使用任何文本调用文本摘要函数,将以法律和金融方面的例子进行文本摘要,将挑战该方法的局限性。

Creating a summarization function

首先, 我们创建一个名为summary的摘要函数,该函数有两个参数,第一个参数是preprocess_text,即要摘要的文本。第二个参数是ml,是摘要文本的最大长度。

我们将 T5任务前缀prefix“summarize"”应用于输入文本,T5模型有一个统一的结构, 任务是通过前缀prefix+input sequence方法。这看起来很简单,但NLP transformer模型更接近于此普遍训练和zero-shot下游任务


def summarize(text,ml):
  preprocess_text = text.strip().replace("\n","")
  t5_prepared_Text = "summarize: "+preprocess_text
  print ("Preprocessed and prepared text: \n", t5_prepared_Text)

  tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

  # summmarize 
  summary_ids = model.generate(tokenized_text,
                                      num_beams=4,
                                      no_repeat_ngram_size=2,
                                      min_length=30,
                                      max_length=ml,
                                      early_stopping=True)

  output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
  return output

看起来很简单,对吧?从RNN和CNN到Transformers用了35年以上的时间。世界上一些最聪明的研究团队从为特定任务设计的transformers模型到多任务模型,几乎不需要微调。谷歌研究团队创建了一个标准格式的transformer的输入文本,其中包含一个前缀prefix,该前缀指示要解决的NLP问题,这是一个壮举!

A general topic sample

text="""
The United States Declaration of Independence was the first Etext
released by Project Gutenberg, early in 1971.  The title was stored
in an emailed instruction set which required a tape or diskpack be
hand mounted for retrieval.  The diskpack was the size of a large
cake in a cake carrier, cost $1500, and contained 5 megabytes, of
which this file took 1-2%.  Two tape backups were kept plus one on
paper tape.  The 10,000 files we hope to have online by the end of
2001 should take about 1-2% of a comparably priced drive in 2001.
"""
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)

运行结果如下:

Number of characters: 534
Preprocessed and prepared text: 
 summarize: The United States Declaration of Independence was the first Etextreleased by Project Gutenberg, early in 1971.  The title was storedin an emailed instruction set which required a tape or diskpack behand mounted for retrieval.  The diskpack was the size of a largecake in a cake carrier, cost $1500, and contained 5 megabytes, ofwhich this file took 1-2%.  Two tape backups were kept plus one onpaper tape.  The 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in 2001.


Summarized text: 
 the united states declaration of independence was the first etext published by project gutenberg, early in 1971. the 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in

字符数:534

预处理和准备的文本:

概述:《美国独立宣言》是古腾堡计划于1971年初发布的第一份独立宣言。标题存储在一个电子邮件指令集中,该指令集需要一个磁带或磁盘包,并装载以供检索。该磁盘包的大小相当于蛋糕载体中的一块大蛋糕,价格为1500美元,包含5兆字节,其中该文件占1-2%。保留了两个磁带备份和一个纸带备份。我们希望在2001年年底之前在线上拥有的10000个文件将占2001年同等价格硬盘的1-2%。

摘要案文:

《美国独立宣言》是古腾堡计划于1971年初发表的第一份电子文本。我们希望在2001年年底之前在线拥有10000个文件,这将占同等价格硬盘的1-2%

The Bill of Rights sample

#Bill of Rights,V
text ="""
No person shall be held to answer for a capital, or otherwise infamous crime,
unless on a presentment or indictment of a Grand Jury, except in cases arising
in the land or naval forces, or in the Militia, when in actual service
in time of War or public danger; nor shall any person be subject for
the same offense to be twice put in jeopardy of life or limb;
nor shall be compelled in any criminal case to be a witness against himself,
nor be deprived of life, liberty, or property, without due process of law;
nor shall private property be taken for public use without just compensation.

"""
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)
 

运行结果如下:

Number of characters: 591
Preprocessed and prepared text: 
 summarize: No person shall be held to answer for a capital, or otherwise infamous crime,unless on a presentment or indictment of a Grand Jury, except in cases arisingin the land or naval forces, or in the Militia, when in actual servicein time of War or public danger; nor shall any person be subject forthe same offense to be twice put in jeopardy of life or limb;nor shall be compelled in any criminal case to be a witness against himself,nor be deprived of life, liberty, or property, without due process of law;nor shall private property be taken for public use without just compensation.


Summarized text: 
 no person shall be held to answer for a capital, or otherwise infamous crime, unless ona presentment or indictment ofa Grand Jury. nor shall any person be subject for the same offense to be twice put

A corporate law sample

#Montana Corporate Law
#https://corporations.uslegal.com/state-corporation-law/montana-corporation-law/#:~:text=Montana%20Corporation%20Law,carrying%20out%20its%20business%20activities.

text ="""The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose.  In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities.  The corporation can sue and be sued in its corporate name.  It has perpetual succession.  The corporation can buy, sell or otherwise acquire an interest in a real or personal property.  It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country.  It can appoint officers and agents of the corporation for various duties and fix their compensation.
The name of a corporation must contain the word “corporation” or its abbreviation “corp.”  The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same state.  It should not be deceptively identical to the fictitious name adopted by a foreign corporation having business transactions in the state.
The corporation is formed by one or more natural persons by executing and filing articles of incorporation to the secretary of state of filing.  The qualifications for directors are fixed either by articles of incorporation or bylaws.  The names and addresses of the initial directors and purpose of incorporation should be set forth in the articles of incorporation.  The articles of incorporation should contain the corporate name, the number of shares authorized to issue, a brief statement of the character of business carried out by the corporation, the names and addresses of the directors until successors are elected, and name and addresses of incorporators.  The shareholders have the power to change the size of board of directors.
"""
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)
 
Number of characters: 1816
Preprocessed and prepared text: 
 summarize: The law regarding corporations prescribes that a corporation can be incorporated in the state of Montana to serve any lawful purpose.  In the state of Montana, a corporation has all the powers of a natural person for carrying out its business activities.  The corporation can sue and be sued in its corporate name.  It has perpetual succession.  The corporation can buy, sell or otherwise acquire an interest in a real or personal property.  It can conduct business, carry on operations, and have offices and exercise the powers in a state, territory or district in possession of the U.S., or in a foreign country.  It can appoint officers and agents of the corporation for various duties and fix their compensation.The name of a corporation must contain the word “corporation” or its abbreviation “corp.”  The name of a corporation should not be deceptively similar to the name of another corporation incorporated in the same state.  It should not be deceptively identical to the fictitious name adopted by a foreign corporation having business transactions in the state.The corporation is formed by one or more natural persons by executing and filing articles of incorporation to the secretary of state of filing.  The qualifications for directors are fixed either by articles of incorporation or bylaws.  The names and addresses of the initial directors and purpose of incorporation should be set forth in the articles of incorporation.  The articles of incorporation should contain the corporate name, the number of shares authorized to issue, a brief statement of the character of business carried out by the corporation, the names and addresses of the directors until successors are elected, and name and addresses of incorporators.  The shareholders have the power to change the size of board of directors.


Summarized text: 
 a corporation can be incorporated in the state of Montana to serve any lawful purpose. the corporation has perpetual succession and can sue and be sued in its corporate name. it can conduct business, carry on operations, and have offices

星空智能对话机器人系列博客

标签:NLP,Transformer,星空,机器人,对话,自然语言
来源: https://blog.csdn.net/duan_zhihua/article/details/120770732

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有