ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

谷歌翻译 google translation API github开源 实践

2021-12-27 15:02:00  阅读:178  来源: 互联网

标签:翻译 google text list replace write API path github


目前找到的唯一可用的(翻译几万条以上的句子)google API 是github上开源的:https://github.com/Saravananslb/py-googletranslation
目前亲测翻译了40000+的语料,可用,在实践中也发现了源码中没有解决的几个小问题,于是做了一些修改,新代码发布在:https://github.com/lu-tomato/py-googletranslation

下面记录一下使用过程和遇到的问题:

A 使用条件

科学上网(VPN)

B 使用流程

从上面的链接下载开源项目,本地安装:

pip install pygoogletranslation

之后就可以使用啦!

我的使用过程:
在文件根目录下创建文件translate.py,先简单测试一下是否能翻译:

from pygoogletranslation import Translator
def trans_single():
    text="a trench is a type of excavation or depression in the ground that is generally deeper than it is wide ( as opposed to a wider gully, or ditch ), and narrow compared with its length ( as opposed to a simple hole ). in geology, trenches are created as a result of erosion by rivers or by geological movement of tectonic plates. in the civil engineering field, trenches are often created to install underground infrastructure or utilities ( such as gas mains, water mains or telephone lines ), or later to access these installations. trenches have also often been dug for military defensive purposes. in archaeology, the trench method is used for searching and excavating ancient ruins or to dig into strata of sedimented material."
    lang="zh-cn"

    translator = Translator(retry_messgae=True)
    t = translator.translate(text, dest=lang)
    translation = t.text
    print(text, translation)
# trans_single()

C 在测试过程中发现一些问题

1)在翻译法语时,会出现两个返回值以致返回失败的情况

eg:
正常情况下,如将英文单词“good”翻译为中文:
在这里插入图片描述

但同样的情况翻译为法语:
在这里插入图片描述
我们发现,法文中一些单词会因为性别不用而翻译为不同的单词,这导致一些简短的句子在代码中翻译时报错,原因是该代码没有处理这种情况。

2)句子翻译不完整,只翻译了第一句话

在翻译长句子的时候发现,无论句子多长,源码只会返回第一句的翻译结果,就以上两个问题,我的主要修改在pygoogletranslation/utils.py

def format_translation(translated,dest):
    text = ''
    pron = ''
    for _translated in translated:
        try:
            text += _translated[0][2][1][0][0][5][0][0]
            try:
                if dest=="fr":
                    tran_list = _translated[0][2][1][0][0][0] if _translated[0][2][1][0][0][2]=='(masculine)' else _translated[0][2][1][0][1][0] #当翻译为法语时,如果返回阳性和阴性两个答案,默认返回阴性
                else:
                    tran_list = _translated[0][2][1][0][0][5]
                for tmp_tran in tran_list: #遍历所有句子的翻译
                    text += tmp_tran[0]
            except:
                tran_list = _translated[0][2][1][0]
                for tmp_tran in tran_list:
                    text += tmp_tran[0]

3)部分影响翻译的标点符号未考虑

源码中已经考虑了几种影响翻译的中文标点,在这里,我通过实践,又新增了几种,修改在pygoogletranslation/translate.py中:

            text = text.replace('"', '')
            text = text.replace("'", "")
            text = text.replace("“", "")
            text = text.replace("”", "")
            text = text.replace("‘", "")  #新增
            text = text.replace("’", "") #新增
            text = text.replace("\\", "") #新增
            text = text.replace("?", "") #新增

以上全部修改更新在https://github.com/lu-tomato/py-googletranslation中,如果需要可以直接下载使用

D 最后整体翻译我的数据集

最后将它批量用于翻译我的数据集:

from pygoogletranslation import Translator
import jsonlines
import json
import time
def trans(dataset,path,lang):
    translator = Translator(retry_messgae=True)
    row_path="datapath/"

    with open(row_path+dataset+path+".jsonl","r",encoding="utf-8") as f:
        lines=jsonlines.Reader(f)
        write_in_list=[]
        for line in lines:
            id=line["id"]
            text=line["text"]
            # text=line["text"].strip().replace("[","").replace(']','').replace("'",'').replace("‘","").replace("’","").replace("»","").replace("?","").replace('\\',"").replace("(","( ").replace(")"," )").replace("`","").replace('"','').replace("!","") #处理一些因特殊符号翻译失败的情况
            try: #用try except来处理一些无法翻译的特殊情况
                t = translator.translate(text, dest=lang)
                translation = t.text
                if translation.strip()=="":
                    print(id,path,text)

                else:
                    write_in = dict()
                    write_in["id"] = id
                    write_in["text"] = text
                    write_in["translation"] = translation

                    write_in_list.append(write_in)

                    count+=1
            except:
                print(id,path,text)

            if count%200==0: #每200写入一次文件
                with open(row_path + dataset + path + lang+"_trans.jsonl", "a", encoding="utf-8") as wf:
                    for write_in in write_in_list:
                        wf.write(json.dumps(write_in, ensure_ascii=False) + "\n")
                write_in_list=[]
                time.sleep(3)
        if write_in_list:
            with open(row_path + dataset + path + lang + "_trans.jsonl", "a", encoding="utf-8") as wf:
                for write_in in write_in_list:
                    wf.write(json.dumps(write_in, ensure_ascii=False) + "\n")
            write_in_list = []
            time.sleep(3)
    print(count)
    time.sleep(5)

for lang in ["de","ja","fr","zh-cn"]:
   for path in ["dev","test","train"]:
        print("load dataset",path,lang)
        trans("dataset/", path, lang) #逐个翻译数据集文件中的训练、验证、测试

标签:翻译,google,text,list,replace,write,API,path,github
来源: https://blog.csdn.net/xiyou__/article/details/122171005

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有