惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
Scott Helme
Scott Helme
爱范儿
爱范儿
WordPress大学
WordPress大学
博客园 - 三生石上(FineUI控件)
阮一峰的网络日志
阮一峰的网络日志
博客园 - Franky
V
V2EX
腾讯CDC
博客园_首页
博客园 - 司徒正美
酷 壳 – CoolShell
酷 壳 – CoolShell
T
Tailwind CSS Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
小众软件
小众软件
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
月光博客
月光博客
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog
雷峰网
雷峰网
Stack Overflow Blog
Stack Overflow Blog
IT之家
IT之家
罗磊的独立博客
Recorded Future
Recorded Future
博客园 - 聂微东
O
OpenAI News
S
Secure Thoughts
Hacker News: Ask HN
Hacker News: Ask HN
S
Schneier on Security
Hacker News - Newest:
Hacker News - Newest: "LLM"
Y
Y Combinator Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
Project Zero
Project Zero
宝玉的分享
宝玉的分享
K
Kaspersky official blog
N
Netflix TechBlog - Medium
T
The Exploit Database - CXSecurity.com
Google Online Security Blog
Google Online Security Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Webroot Blog
Webroot Blog
云风的 BLOG
云风的 BLOG
Simon Willison's Weblog
Simon Willison's Weblog
C
Check Point Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
L
LINUX DO - 热门话题
美团技术团队
L
Lohrmann on Cybersecurity

博客园 - BigRain

c# regex regextools cache 访问频率的思考 python 百度cpc点击 20130118SQL记录 PPT辅助工具 检查外链的方法 webService启用cookie的另一种方法 百度调价HttpWebRequest 装饰者模式 第一个android程序闪亮登场 SQL 的一些批量插入和修改 职责链模式的运用 State Pattern 程序人生 SerialPort实现modem的来电显示 利用枚举进行状态的设计 singleton模式 在软件开发中的运用 我对当前项目的一些看法 闲话闲说——关于异常
小学生学习汉字,汉字抓取
BigRain · 2022-04-24 · via 博客园 - BigRain

主要在正则表达式。


words = "泉眼无声惜细流,树荫照水弄轻柔。小荷才露尖尖角,早有蜻蜓立上头。" words_info = {} # url_format = 'https://hanyu.baidu.com/s?wd={0}&from=zici' def getWords(words): url_format = 'https://hanyu.baidu.com/s?wd={0}&ptype=zici' for char in words: # 判断是否是汉字 if not ('\u4e00' <= char <= '\u9fff'): continue if char in words_info: print("\t 已经采集过,跳过") continue res = requests.get(url_format.format(char)) html = res.text print(f"@{time.time()}") print(f"开始采集加载分析汉字:[{char}]") info = char_info(char, html) info["name"] = char words_info[char] = info gif_url = info["tupian"] res = requests.get(gif_url) fs_name = "gif\{0}.gif".format(char) # fs_name="aa.gif" with open(fs_name, "wb") as fs: fs.write(res.content) print(f"{char} write sussful!采集完成") # 避免采集过快,被屏蔽 # time.sleep(2) return words_info def char_info(name: str, html: str): info = {} info["name"] = name # 部首 position = html.find('<li id="radical">') assert position > 0 end = html.find("</li>", position) match = re.search(r"<span>(.{1})</span>", html[position:end]) info["bushou"] = match.group(1) if match else "" print("部首:", info["bushou"]) # 拼音 position = html.find('<li id="tone_py">') assert position > 0 end = html.find("</li>", position) match = re.search(r"<b>(.*)</b>", html[position:end]) info["pinyin"] = match.group(1) if match else "" print("拼音:", info["pinyin"]) # 组词 position = html.find( '<h1><b class="title" id="related_term">相关组词</b></h1>') assert position > 0 end = html.find("</div>", position) match = re.findall(r"<a [^>]*>([^<]*)</a>", html[position:end], re.IGNORECASE | re.MULTILINE) info["zhuci"] = match print("组词", info["zhuci"]) # 书写gif图片 position = html.find('<img id="word_bishun" class="bishun"') assert position > 0 end = html.find("</div>", position) match = re.search( #r'data-gif="([^"]*)"' match.group(1) #r"""data-gif=["']([^"]*)["']""" match.group(1) r'https://hanyu-word-gif.cdn.bcebos.com/\w{30,}.gif', html[position:end], re.IGNORECASE | re.MULTILINE) info["tupian"] = match.group(0) if match else "" print("笔画顺序:", info["tupian"]) return info if __name__ == '__main__': import os if os.path.exists('words.json'): with open('words.json', 'r',encoding="utf-8") as fs: json_text = fs.read() if json_text.startswith("{"): words_info = words_info or {} words_info.update(json.loads(json_text)) getWords(words) with open('words.json', 'w',encoding="utf-8") as fs: json_text = json.dumps(words_info,ensure_ascii=False) fs.write(json_text)