惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

博客园 - Jans

Flutter运行时闪退 Thinkpad T460声音问题 Error 56: The Cisco Systems, Inc. VPN Service has not been started(Cisco VPN在Vista/Win 7下出现Error 56的解决办法) - Jans Weblogic中配置Active Directory Authentication Provider weblogic 10.3.5重置密码 在基于WCF开发的Web Service导出WSDL定义问题及自定义wsdl:port 名称 - Jans JNI:在线程或信号处理函数中访问自定义类 (转)Java实现Web Service过程中处理SOAP Header的问题 Geek to Live: Set up your personal Wikipedia - Jans (转)GitHub上整理的一些工具,求补充 - (转)老衣的开发工具和类库集之2014版 GCC的内存边界对齐 如何删除Weblogic域 - Jans 电蚊拍选购参考 Localizing WPF with .resx files - Jans C#操作串口总结 ASP.NET MVC的国际化问题 小众的分布式版本管理工具Code Co-op - Jans 关于效率长尾现象
利用 Python 爬虫来对文本进行批量化翻译
Jans · 2023-03-15 · via 博客园 - Jans

很多翻译软件网站都提供了 API,这样可以自己编写应用程序,自动化翻译工作,但是免费的 API 越来越少,而且限制颇多,
有时候就只是懒得重复性复制粘贴,于是想用 Python 的爬虫软件包 selenium 调用网页数据进行进行自动化翻译操作。
网上类似开源软件找了一遍,发现基本上都被翻译软件网站也进行了反制措施后实效,页面使用了各种动态 JS,防止被程序化抓取调用。
但强大的 selenium 还是可以搞定,以下是代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import StaleElementReferenceException
from time import sleep
from selenium.webdriver.chrome.options import Options
import os
import re
import sys
import glob
from loguru import logger
import random
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--target", type=str, choices=["french", "english", "german", "spanish", "portuguese", "italian", "dutch", "polish", "russian"],
                help="""Choose target language for translation {"french", "english", "german", "spanish", "portuguese", "italian", "dutch", "polish", "russian"}""")
args = parser.parse_args()

logger.add(sys.stderr, format="{time} {message}", filter="selenium_translate", level="INFO")

TEXTS_FOLDER_PATH = './texts_folder/'
TRANSLATED_TEXTS_PATH = './texts_folder/translated_texts/'

if os.path.exists(TEXTS_FOLDER_PATH):
    text_paths = glob.glob(TEXTS_FOLDER_PATH + "*.txt")
    if not os.path.exists(TRANSLATED_TEXTS_PATH):
        logger.info(f"Creating new folder: '{TRANSLATED_TEXTS_PATH}'")
        os.makedirs(TRANSLATED_TEXTS_PATH)
else:
    raise AssertionError("Could not find './texts_folder/', folder in which you should add the texts files you want to translate")

options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')

WEBDRIVER_PATH = './webdrivers/chromedriver'
DEEPL_URL = 'https://www.deepl.com/translator#en/pt/'

HEADLESS = True
if HEADLESS:
    driver = webdriver.Chrome(WEBDRIVER_PATH, options=options)
    logger.info("Running headless...")
else:
    driver = webdriver.Chrome(WEBDRIVER_PATH)

TARGET_LANGUAGES = ({"french":1,
                    "english":2,
                    "german":3,
                    "spanish":4,
                    "portuguese":5,
                    "italian":6,
                    "dutch":7,
                    "polish":8,
                    "russian":9})

def translate(text_to_translate, target_language=TARGET_LANGUAGES["portuguese"]):

    driver.get(DEEPL_URL)
    input_text_xpath = """//*[@id="panelTranslateText"]/div[1]/div[2]/section[1]/div[3]/div[2]/d-textarea/div"""
    output_text_xpath = """//*[@id="panelTranslateText"]/div[1]/div[2]/section[2]/div[3]/div[1]/d-textarea/div/p[1]"""
    language_select_xpath = """//*[@id="dl_translator"]/div[1]/div[3]/div[1]/div/button/div"""
    target_language_xpath = f"""//*[@id="dl_translator"]/div[1]/div[3]/div[1]/div/div/button[{target_language}]"""

    input_box = WebDriverWait(driver, 20).until(lambda driver: driver.find_element_by_xpath(input_text_xpath))
    ActionChains(driver).click(input_box).send_keys(text_to_translate).perform()
    sleep(random.uniform(2,3))
    output_box = WebDriverWait(driver, 45).until(lambda driver: driver.find_element_by_xpath(output_text_xpath))
    try:
        translated_text = output_box.text
    except e as StaleElementReferenceException:
        sleep(random.uniform(3,5))
        output_box = WebDriverWait(driver, 45).until(lambda driver: driver.find_element_by_xpath(output_text_xpath))
        translated_text = output_box.text

    logger.info("返回:"+translated_text)
    return translated_text



def main(target):
    logger.info("enumerate files...")
    translated_text = "";
    for i, path in enumerate(text_paths):
        logger.info(f"Text file:{path}")
        with open(path) as f:
            lines = f.readlines()
            for text in lines:
                translated_text += translate(text) +"\r\n"
                sleep(random.uniform(2,3))
                logger.info(f"Progress: Translated -> {text}")
        output_fname = os.path.join(TRANSLATED_TEXTS_PATH, 'TRANSLATED_' + os.path.basename(path))
        logger.info(f"Writing to file '{output_fname}'")
        with open(output_fname, 'w') as out:
            out.write(translated_text)
            out.flush

    if not HEADLESS:
        sleep(10)

    driver.close()
    logger.info("Done.")

if __name__ == "__main__":
    if args.target in TARGET_LANGUAGES.keys():
        main(args.target)
    else:
        logger.info("""Could not understand the target language for translation. Target language mut be in this set {"french", "english", "german", "spanish", "portuguese", "italian", "dutch", "polish", "russian"}""")

用法如下:

先要用安装 python 的下列软件包

   pip3 -i selenium
   pip3 -i loguru

这两个软件包
在https://chromedriver.chromium.org/ 下载 chrome web driver
并放在当前目录下
将以上 python 代码保存为 trans_tool.py
建一个子目录 texts_folder, 将待翻译的文本(一次翻译一行),放入目录内
在命令行执行 python3 trans_tool.py --target portuguese
我这里是英语翻译为葡萄亚语,你也可以选择其他目标语言。
希望对你有帮助!
我的微信