惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

博客园 - 慕尘

在浏览器跑 Qwen2.5 使用 WSL 在 Windows 上安装 Linux LangExtract pgvector 向量数据库 Faiss Goose unstructured python里使用Playwright python的jieba MinGW nomic-embed-text 解析非结构化数据 LangChain 的 DocumentLoader 能够使用require但不能使用import ChromaDB nvm-windows 使用js实现文字转语音 pyttsx3 Ollama笔记
trafilatura
慕尘 · 2025-03-19 · via 博客园 - 慕尘

trafilatura是一个专为从网页中提取核心内容设计的Python库

特别适用于那些需要从HTML页面中提取主要文本信息的应用场景,比如文章正文、标题等,同时排除掉导航栏、广告、侧边栏和其他非主要内容

安装

pip install trafilatura

示例

import trafilatura

# 指定网页 URL
url = "https://www.cnblogs.com/baby123/p/18755330"
# 下载网页内容
downloaded = trafilatura.fetch_url(url)
# 提取核心文本内容
result = trafilatura.extract(downloaded)
print(result)

对于一些动态加载内容的网站,可能需要先使用Playwright 或 Selenium 工具来获取完整的HTML内容,然后再使用 Trafilatura 进行内容提取

但是这样速度会变慢

import asyncio
from playwright.async_api import async_playwright
import trafilatura
import time

async def fetch_dynamic_content(browser, url):
    page = await browser.new_page()
    try:
        await page.goto(url)
        # 使用 'networkidle' 等待页面加载完成
        await page.wait_for_load_state('networkidle')
        html_content = await page.content()
        return html_content
    finally:
        await page.close()

def extract_core_content(html_content):
    # 使用 Trafilatura 提取核心内容
    result = trafilatura.extract(html_content)
    return result

async def main():
    start_count = time.perf_counter()
    
    urls = ["http://jinan.tianqi.com/"]  # 可以添加更多URL以测试并行处理
    
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        tasks = [fetch_dynamic_content(browser, url) for url in urls]
        
        results = await asyncio.gather(*tasks)
        
        core_contents = []
        for html_content in results:
            core_content = extract_core_content(html_content)
            core_contents.append(core_content)
            print(core_content)
        
        await browser.close()
    
    end_count = time.perf_counter()
    elapsed_time = round(end_count - start_count, 2)
    print(f"本次查找时间:{elapsed_time} 秒")

# 运行主函数
if __name__ == "__main__":
    asyncio.run(main())