ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

5个简单的步骤使用Pytorch进行文本摘要总结

2022-02-05 13:58:13  阅读:188  来源: 互联网

标签:摘要 text t5 number Pytorch doses 文本



介绍
文本摘要是自然语言处理(NLP)的一项任务,其目的是生成源文本的简明摘要。不像摘录摘要,摘要不仅仅简单地从源文本复制重要的短语,还要提出新的相关短语,这可以被视为释义。摘要在不同的领域产生了大量的应用,从书籍和文献,科学和研发,金融研究和法律文件分析。

到目前为止,对抽象摘要最有效的方法是在摘要数据集上使用经过微调的transformer模型。在本文中,我们将演示如何在几个简单步骤中使用功能强大的模型轻松地总结文本。我们将要使用的模型已经经过了预先训练,所以不需要额外的训练:)

让我们开始吧!

步骤1:安装Transformers库
我们要用的库是Huggingface实现的Transformers 。如果你不熟悉Transformers ,你可以继续阅读我之前的文章。

要安装变压器,您可以简单地运行:

 pip install transformers
注意需要事先安装Pytorch。如果您还没有安装Pytorch,请访问Pytorch官方网站并按照说明安装它。

步骤2:导入库
成功安装transformer之后,现在可以开始将其导入到Python脚本中。我们也可以导入os来设置GPU在下一步使用的环境变量。注意,这是完全可选的,但如果您有多个gpu(如果您使用的是jupiter笔记本),这是防止错误的使用其他gpu的一个好做法。

 from transformers import pipeline
 import os
步骤3:设置使用的GPU和模型
如果你决定设置GPU(例如0),那么你可以如下图所示:

 os.environ["CUDA_VISIBLE_DEVICES"] = "0"
现在,我们准备好选择要使用的摘要模型了。Huggingface提供两种强大的摘要模型使用:BART (BART -large-cnn)和t5 (t5-small, t5-base, t5-large, t5- 3b, t5- 11b)。你可以在他们的官方paper(BART paper, t5 paper)上了解更多。

要使用在CNN/每日邮报新闻数据集上训练的BART模型,您可以通过Huggingface的内置管道模块直接使用默认参数:

 summarizer = pipeline("summarization")
如果你想使用t5模型(例如t5-base),它是在c4 Common Crawl web语料库进行预训练的,那么你可以这样做:

 summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")
步骤4:输入文本进行总结
现在,在我们准备好我们的模型之后,我们可以开始输入我们想要总结的文本。想象一下,我们想从MedicineNet的一篇文章中总结以下关于COVID-19疫苗的内容:

One month after the United States began what has become a troubled rollout  of a national COVID vaccination campaign, the effort is finally  gathering real steam.

Close to a million doses — over 951,000, to be more exact — made their way  into the arms of Americans in the past 24 hours, the U.S. Centers for  Disease Control and Prevention reported Wednesday. That’s the largest  number of shots given in one day since the rollout began and a big jump  from the previous day, when just under 340,000 doses were given, CBS News reported.

That number is likely to jump quickly after the federal government on  Tuesday gave states the OK to vaccinate anyone over 65 and said it would release all the doses of vaccine it has available for distribution.  Meanwhile, a number of states have now opened mass vaccination sites in  an effort to get larger numbers of people inoculated, CBS News reported.

我们定义变量:

 text = """One month after the United States began what has become a troubled rollout of a national COVID vaccination campaign, the effort is finally gathering real steam.
 Close to a million doses -- over 951,000, to be more exact -- made their way into the arms of Americans in the past 24 hours, the U.S. Centers for Disease Control and Prevention reported Wednesday. That's the largest number of shots given in one day since the rollout began and a big jump from the previous day, when just under 340,000 doses were given, CBS News reported.
 That number is likely to jump quickly after the federal government on Tuesday gave states the OK to vaccinate anyone over 65 and said it would release all the doses of vaccine it has available for distribution. Meanwhile, a number of states have now opened mass vaccination sites in an effort to get larger numbers of people inoculated, CBS News reported."""
步骤4:总结
最后,我们可以开始总结输入的文本。这里,我们声明了希望汇总输出的min_length和max_length,并且关闭了采样以生成固定的汇总。我们可以通过运行以下命令来实现:

 summary_text = summarizer(text, max_length=100, min_length=5, do_sample=False)[0]['summary_text']
 print(summary_text)
我们得到总结文本:

Over 951,000 doses of vaccine given in one day in the past 24 hours, CDC says . That’s the largest number of shots given in a month since the  rollout began . The federal government gave states the OK to vaccinate  anyone over 65 on Tuesday . A number of states have now opened mass  vaccination sites in an effort to get more people inoculated, CBS News  reports .

从总结的文本中可以看出,该模型知道24小时相当于一天,并聪明地将美国疾病控制与预防中心(U.S. Centers for Disease Control and Prevention)缩写为CDC。此外,该模型成功地从第一段和第二段链接信息,指出这是自上个月开始展示以来给出的最大次数。我们可以看到,该摘要模型的性能相当不错。

最后把所有这些放在一起,这里是jupyter notebook形式的整个代码:

https://gist.github.com/itsuncheng/f3c4dde81ac4651383c4480958da4f8e#file-summarization-ipynb

Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).

Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” arXiv preprint arXiv:1910.10683 (2019).
————————————————
版权声明:本文为CSDN博主「uoiqu90093jgj」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/ai52learn/article/details/112765454

标签:摘要,text,t5,number,Pytorch,doses,文本
来源: https://blog.csdn.net/javastart/article/details/122790011

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有