惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

宝玉的分享
宝玉的分享
NISL@THU
NISL@THU
E
Exploit-DB.com RSS Feed
L
LINUX DO - 热门话题
L
Lohrmann on Cybersecurity
K
Kaspersky official blog
Project Zero
Project Zero
Cisco Talos Blog
Cisco Talos Blog
T
The Exploit Database - CXSecurity.com
P
Palo Alto Networks Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Threatpost
S
Schneier on Security
G
GRAHAM CLULEY
The Hacker News
The Hacker News
T
Threat Research - Cisco Blogs
Scott Helme
Scott Helme
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Privacy & Cybersecurity Law Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
Cyberwarzone
Cyberwarzone
C
CERT Recently Published Vulnerability Notes
T
Tor Project blog
AWS News Blog
AWS News Blog
Simon Willison's Weblog
Simon Willison's Weblog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
爱范儿
爱范儿
P
Privacy International News Feed
云风的 BLOG
云风的 BLOG
P
Proofpoint News Feed
S
Securelist
G
Google Developers Blog
The Last Watchdog
The Last Watchdog
Google Online Security Blog
Google Online Security Blog
美团技术团队
F
Fortinet All Blogs
小众软件
小众软件
Recorded Future
Recorded Future
V
Visual Studio Blog
B
Blog RSS Feed
H
Help Net Security
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
博客园 - 聂微东
Stack Overflow Blog
Stack Overflow Blog
Martin Fowler
Martin Fowler
Latest news
Latest news
Spread Privacy
Spread Privacy
H
Heimdal Security Blog

博客园 - 昕

Egret飞行模拟-开发记录03-LoadingUI界面 Egret飞行模拟-开发记录02 Egret飞行模拟-开发记录01 python学习笔记之五 中岛美雪——疑似穿越人物 tensorflow学习001——MNIST 机器学习笔记之三-yolov3+win7+vs2017+gpu+opencv编译 机器学习笔记之二-win10+cuda9.1+CUDNN7+Anaconda3+VS2017+tensorflow1.5+opencv3.4 机器学习笔记之一 python网页爬虫开发之七-多线程爬虫示例01 python学习笔记之四-多进程&多线程&异步非阻塞 python网页爬虫开发之六-Selenium使用 python网页爬虫开发之五-反爬 python学习笔记之三-计算运行时间 python网页爬虫开发之三 python网页爬虫开发之二 U3D学习14-一阶段学习总结 U3D学习13-数据存储 U3D学习12-黑暗之光实例
python网页爬虫开发之四-串行爬虫代码示例
· 2018-10-25 · via 博客园 - 昕

实现功能:代理、限速、深度、反爬

import re

import queue

import urllib.parse

import urllib.robotparser

import time

from urllib import request

from datetime import datetime

def download(url, user_agent="wsap", num=2):

    print("Downloading:"+url)

    try:

        req = request.Request(url)

        req.add_header('user_agent', user_agent)

        html = request.urlopen(req).read()

    except Exception as e:

        print('Download error:')

        html = None

        if num > 0:

            if hasattr(e, "code") and 500 <= e.code < 600:

                return download(url, user_agent, num-1)

    return html

def link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='BadCrawler', proxy=None, num_retries=1):

    crawl_queue = queue.deque([seed_url])

    seen = {seed_url: 0}

    num_urls = 0

    rp = get_robots(seed_url)

    throttle = Throttle(delay)

    headers = headers or {}

    if user_agent:

        headers['User-agent'] = user_agent

    while crawl_queue:

        url = crawl_queue.pop()

        if rp.can_fetch(user_agent, url):

            throttle.wait(url)

            html = download(url)

            links = []

            depth = seen[url]

            if depth != max_depth:

                if link_regex:

                    links.extend(link for link in get_links(html) if re.match(link_regex, link))

                for link in links:

                    link = normalize(seed_url, link)

                    if link not in seen:

                        seen[link] = depth +1

                        if same_domain(seed_url, link):

                            crawl_queue.append(link)

            num_urls += 1

            if num_urls == max_urls:

                break

        else:

            print('Blocked by robots.txt:'+url)

class Throttle:

    def __init__(self, delay):

        self.delay = delay

        self.domains = {}

    def wait(self, url):

        domain = urllib.parse.urlparse(url).netloc

        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:

            sleep_secs = self.delay - (datetime.now() - last_accessed).seconds

            if sleep_secs > 0:

                time.sleep(sleep_secs)

        self.domains[domain] = datetime.now()

def normalize(seed_url,link):

    link, _ = urllib.parse.urldefrag(link)

    return urllib.parse.urljoin(seed_url, link)

def same_domain(url1, url2):

    return urllib.parse.urlparse(url1).netloc == urllib.parse.urlparse(url2).netloc

def get_robots(url):

    rp = urllib.robotparser.RobotFileParser()

    rp.set_url(urllib.parse.urljoin(url, '/robots.txt'))

    rp.read()

    return rp

def get_links(html):

    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)

    html = html.decode('utf-8')

    return webpage_regex.findall(html)

if __name__ == '__main__':

    link_crawler('http://example.webscraping.com', '/places/default/view/', delay=0, num_retries=1, max_depth=1, user_agent='GoodCrawler')