未加星标

Write your first web crawler in Python Scrapy

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二04 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

Thescraping series will not get completed without discussing Scrapy . In this post I am going to write a web crawler that will scrape data from OLX’s Electronics & Appliances’ items. Before I get into the code, how about having a brief intro of Scrapy itself?

What is Scrapy?

From Wikipedia :

Scrapy (/skrepi/ skray-pee)[1] is a free and open source web crawling framework , written in python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.[2] It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.

A web crawling framework which has done all the heavy lifting that is needed to write a crawler. What are those things, I will explore further below.

Read on!

Creating Project

Scrapy introduces the idea of a project with multiple crawlers or spiders in a single project. This concept is helpful specially if you are writing multiple crawlers of different sections of a site or sub-domains of a site. So, first create the project:

Adnans-MBP:ScrapyCrawlersAdnanAhmad$ scrapystartprojectolx New Scrapyproject 'olx', usingtemplatedirectory '//anaconda/lib/python2.7/site-packages/scrapy/templates/project', createdin: /Development/PetProjects/ScrapyCrawlers/olx Youcanstartyourfirstspiderwith: cd olx scrapygenspiderexampleexample.com CreatingCrawler

I ran the command scrapy startproject olx which will create a project with name olx and helpful information for next steps. You go to the newly created folder and then execute command for generating first spider with name and the domain of the site to be crawled:

Adnans-MBP:ScrapyCrawlersAdnanAhmad$ cd olx/ Adnans-MBP:olxAdnanAhmad$ scrapygenspiderelectronicswww.olx.com.pk Createdspider 'electronics' usingtemplate 'basic' in module: olx.spiders.electronics

I generated the code of my first Spider with name electronics , since I am accessing the electronics section of OLX I named it like that, you can name it to anything you want or dedicate your first spider to your spouse or (girl|boy)friend

The final project structure will be something like given below:


Write your first web crawler in Python Scrapy

As you can see, there is a separate folder only for Spiders, as mentioned, you can add multiple spiders with in a single project. Let’s open electronics.py spider file. When you open it, you will find something like that:

# -*- coding: utf-8 -*- import scrapy class ElectronicsSpider(scrapy.Spider): name = "electronics" allowed_domains = ["www.olx.com.pk"] start_urls = ['http://www.olx.com.pk/'] def parse(self, response): pass

As you can see, ElectronicsSpider is subclass of scrapy.Spider . The name property is actually name of the spider which was given in the spider generation command. This name will help while running the crawler itself. The allowed_domains property tells which domains are accessible for this crawler and strart_urls is the place to mention initial URLs that to be accessed at first place. Beside file structure this is a good feature to draw the boundaries of your crawler.

The parse method, as the name suggests that to parse the content of the page being accessed. Since I am going to write a crawler that goes to multiple pages, I am going to make a few changes.

from scrapy.spidersimport CrawlSpider, Rule from scrapy.linkextractorsimport LinkExtractor class ElectronicsSpider(CrawlSpider): name = "electronics" allowed_domains = ["www.olx.com.pk"] start_urls = [ 'https://www.olx.com.pk/computers-accessories/', 'https://www.olx.com.pk/tv-video-audio/', 'https://www.olx.com.pk/games-entertainment/' ] rules = ( Rule(LinkExtractor(allow=(), restrict_css=('.pageNextPrev',)), callback="parse_item", follow=True),) def parse_item(self, response): print('Processing..' + response.url) # print(response.text)

In order to make the crawler navigate to many page, I rather subclassed my Crawler from Crawler instead of scrapy.Spider . This class makes crawling many pages of a site easier. You can do similar with the generated code but you’ll need to take care of recursion to navigate next pages.

The next is to set rules variable, here you mention the rules of navigating the site. The LinkExtractor actually takes parameters to draw navigation boundaries. Here I am using restrict_css parameter to set the class for NEXT page. If you go to this page and inspect element you can find something like this:


Write your first web crawler in Python Scrapy

pageNextPrev is the class that be used to fetch link of next pages. The call_back parameter tells which method to use to access the page elements. We will work on this method soon.

Do remember, you need to change name of the method from parse() to parse_item() or whatever to avoid overriding the base class otherwise your rule will not work even if you set follow=True .

So far so good, let’s test the crawler I have done so far. Again, go to terminal and write:

Adnans-MBP:olxAdnanAhmad$ scrapycrawlelectronics

The 3rd parameter is actually the name of the spider which was set earlier in the name property of ElectronicsSpiders class. On console you find lots of useful information that is helpful to debug your crawler. You can disable the debugger if you don’t want to see debugging information. The command will be similar with --nolog switch.

Adnans-MBP:olxAdnanAhmad$ scrapycrawl --nologelectronics

If you run now it will print something like:

Adnans-MBP:olxAdnanAhmad$ scrapycrawl --nologelectronics Processing..https://www.olx.com.pk/computers-accessories/?page=2 Processing..https://www.olx.com.pk/tv-video-audio/?page=2 Processing..https://www.olx.com.pk/games-entertainment/?page=2 Processing..https://www.olx.com.pk/computers-accessories/ Processing..https://www.olx.com.pk/tv-video-audio/ Processing..https://www.olx.com.pk/games-entertainment/ Processing..https://www.olx.com.pk/computers-accessories/?page=3 Processing..https://www.olx.com.pk/tv-video-audio/?page=3 Processing..https://www.olx.com.pk/games-entertainment/?page=3 Processing..https://www.olx.com.pk/computers-accessories/?page=4 Processing..https://www.olx.com.pk/tv-video-audio/?page=4 Processing..https://www.olx.com.pk/games-entertainment/?page=4 Processing..https://www.olx.com.pk/computers-accessories/?page=5 Processing..https://www.olx.com.pk/tv-video-audio/?page=5 Processing..https://www.olx.com.pk/games-entertainment/?page=5 Processing..https://www.olx.com.pk/computers-accessories/?page=6 Processing..https://www.olx.com.pk/tv-video-audio/?page=6 Processing..https://www.olx.com.pk/games-entertainment/?page=6 Processing..https://www.olx.com.pk/computers-accessories/?page=7 Processing..https://www.olx.com.pk/tv-video-audio/?page=7 Processing..https://www.olx.com.pk/games-entertainment/?page=7

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

主题: Python
分页:12
转载请注明
本文标题:Write your first web crawler in Python Scrapy
本站链接:http://www.codesec.net/view/530735.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(51)