惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

量子位
G
GRAHAM CLULEY
S
Security @ Cisco Blogs
T
The Exploit Database - CXSecurity.com
I
Intezer
The Last Watchdog
The Last Watchdog
Project Zero
Project Zero
Simon Willison's Weblog
Simon Willison's Weblog
S
Secure Thoughts
Webroot Blog
Webroot Blog
F
Full Disclosure
L
Lohrmann on Cybersecurity
Microsoft Azure Blog
Microsoft Azure Blog
博客园_首页
The Hacker News
The Hacker News
The Register - Security
The Register - Security
Blog — PlanetScale
Blog — PlanetScale
Jina AI
Jina AI
V
Visual Studio Blog
H
Heimdal Security Blog
NISL@THU
NISL@THU
L
LINUX DO - 最新话题
Hugging Face - Blog
Hugging Face - Blog
TaoSecurity Blog
TaoSecurity Blog
S
Securelist
博客园 - 聂微东
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
MyScale Blog
MyScale Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Hacker News - Newest:
Hacker News - Newest: "LLM"
H
Hacker News: Front Page
T
Tailwind CSS Blog
C
Cisco Blogs
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Stack Overflow Blog
Stack Overflow Blog
Recent Commits to openclaw:main
Recent Commits to openclaw:main
G
Google Developers Blog
C
CERT Recently Published Vulnerability Notes
aimingoo的专栏
aimingoo的专栏
D
DataBreaches.Net
H
Hackread – Cybersecurity News, Data Breaches, AI and More
罗磊的独立博客
SecWiki News
SecWiki News
Application and Cybersecurity Blog
Application and Cybersecurity Blog
AWS News Blog
AWS News Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Engineering at Meta
Engineering at Meta
PCI Perspectives
PCI Perspectives
V
V2EX
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org

博客园 - -Enchant

Linux上搭建Asp.net MVC3环境(CentOS + Nginx + Mono) 《单例模式》你需要注意的问题 系统框架整理 Extjs prompt 显示密码框 单点登录(SSO)的一点思考 Jquery以JSON方式调用WebService WCF初探 关于抓取百度搜索内容 iPhone开发环境搭建(备忘) SMTP/POP3命令简介(转) C/S模式下 简单的定时任务功能 Asp.net MVC2学习笔记索引 Oracle 调优 Oracle常见错误 @OutputCache指令参数 关于ACL权限控制【ASP.NET MVC2】 c#递归生成XML ASP.NET MVC2 Ajax返回JSON 引用类型的对象复制(浅复制和深复制)
Python网页抓取、模拟登录
-Enchant · 2010-10-29 · via 博客园 - -Enchant

用python抓取网页是非常简单的事,简单的几行代码就可以解决。。。这里稍微记录一下

需要引用的包有主要是 urllib2,urllib也可以引入,具体 看代码

#-------------------------------------------------------------------------------
#
 Name:        模拟登录web
#
 Purpose:
#
#
 Author:      huwei
#
#
 Created:     26/10/2010
#
 Copyright:   (c) huwei 2010
#
 Licence:     <your licence>
#
-------------------------------------------------------------------------------
#
!/usr/bin/env python

import time,urllib2,urllibdef main():
    
    
#登录博客园
    loginCNblogs()
    
pass#登录博客园
def loginCNblogs():
    
try:
        
#设置 cookie
        cookies = urllib2.HTTPCookieProcessor()
        opener 
= urllib2.build_opener(cookies)
        urllib2.install_opener(opener)

        parms 

= {"tbUserName":"用户名","tbPassword":"密码","__EVENTTARGET":"btnLogin","__EVENTARGUMENT":"",
        
"__VIEWSTATE":"/wEPDwULLTExMDE0MzIzNDRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBQtjaGtSZW1lbWJlcmcJekJlt5rFwfnjeMMnX9V58Xhg",
        
"__EVENTVALIDATION":"/wEWBQKit6iCDALyj/OQAgK3jsrkBALR55GJDgKC3IeGDK6TQlRlirS2Zja1Lmeh02u4XMwV",
        
"txtReturnUrl":"http://bboy.cnblogs.com"}

        loginUrl 

= "http://passport.cnblogs.com/login.aspx"
        login 
= urllib2.urlopen(loginUrl,urllib.urlencode(parms))
        
        
        
#print(unicode(login.read(),"utf8"))

        
#显示配置页面
        avatar = urllib2.urlopen("http://home.cnblogs.com/set/avatar/")
        
#print(avatar.read().decode("utf8"))
    except Exception,e:
        
print(e)
    
passif __name__ == '__main__':
    main()

获取 网页很简单 直接 urllib2.urlopen(url).read() 就可以得到网页源码

这里是抓取登录后的页面,所有开头需要设置cookie

cookies = urllib2.HTTPCookieProcessor()
opener 
= urllib2.build_opener(cookies)
urllib2.install_opener(opener)

设置完 cookie以后 再使用 urllib2.urlopen()方法就可以带上你登录成功的cookie了