菜单

爬虫网页

2020年06月4日 - 我的心情
#nohup scrapy runspider spider.py > urls.log &
import scrapy

class MySpider(scrapy.Spider):
    name = 'domain'
    allowed_domains = ['domain']
    start_urls = [
        'https://domain/'
    ]

    #avoid repeat
    tempList = []

    def parse(self, response):
        # for h3 in response.xpath('//h3').getall():
        #     yield {"title": h3}

        for link in response.xpath('//a/@href').getall():
            if link not in self.tempList:
                print(link)
                self.tempList.append(link)
                yield scrapy.Request(response.urljoin(link), self.parse)

 

发表评论

电子邮件地址不会被公开。 必填项已用*标注