惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

博客园 - 慕尘

在浏览器跑 Qwen2.5 使用 WSL 在 Windows 上安装 Linux LangExtract pgvector 向量数据库 Faiss Goose trafilatura unstructured python的jieba MinGW nomic-embed-text 解析非结构化数据 LangChain 的 DocumentLoader 能够使用require但不能使用import ChromaDB nvm-windows 使用js实现文字转语音 pyttsx3 Ollama笔记
python里使用Playwright
慕尘 · 2025-03-14 · via 博客园 - 慕尘

Playwright 是由微软开发的一款开源的 Web 自动化测试框架,主要用于自动化测试和浏览器操作

它是一个跨浏览器的自动化工具,支持 Python、JavaScript 等多种语言

安装

pip install playwright

安装 Playwright 支持的浏览器 

playwright install

从 HTML 中提取文字、标题、摘要和关键字

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup  # 用于解析 HTML

def extract_page_content(url):
    with sync_playwright() as p:
        # 启动浏览器
        browser = p.chromium.launch(headless=True)  # 可以设置为 headless=False 方便调试
        page = browser.new_page()

        # 导航到目标页面
        page.goto(url)
        page.wait_for_load_state("networkidle")  # 等待页面加载完成

        # 获取页面的 HTML 内容
        html_content = page.content()

        # 关闭浏览器
        browser.close()

    # 使用 BeautifulSoup 解析 HTML
    soup = BeautifulSoup(html_content, "html.parser")

    # 提取标题
    title = soup.find("title").text if soup.find("title") else "No title found"

    # 提取摘要(meta description)
    meta_description = soup.find("meta", attrs={"name": "description"})
    description = meta_description["content"] if meta_description else "No description found"

    # 提取关键字(meta keywords)
    meta_keywords = soup.find("meta", attrs={"name": "keywords"})
    keywords = meta_keywords["content"] if meta_keywords else "No keywords found"

    # 提取正文内容(去除 HTML 标签)
    text = soup.get_text(separator="\n", strip=True)

    return {
        "title": title,
        "description": description,
        "keywords": keywords,
        "text": text
    }

url = "https://www.cnblogs.com/baby123/p/18772196"
result = extract_page_content(url)
print("Title:", result["title"])
print("Description:", result["description"])
print("Keywords:", result["keywords"])
print("Content:", result["text"])