未加星标

Asynchronous Scraping with Python

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二05 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏

Previously, I've written about thebasics of scraping and how you canfind API calls in order to fetch data that isn't easily downloadable.

For simplicity, the code in these posts has always been synchronous -- given a list of URLs, we process one, then the next, then the next, and so on. While this makes for code that's straight-forward, it can also be slow.

This doesn't have to be the case though. Scraping is often an example of code that is embarrassingly parallel . With some slight changes, our tasks can be done asynchronously, allowing us to process more than one URL at a time.

In version 3.2, python introduced the concurrent.futures module, which is a joy to use for parallelizing tasks like scraping. The rest of this post will show how we can use the module to make our previously synchronous code asynchronous.

Parallelizing your tasks

Imagine we have a list of several thousand URLs. In previous posts, we've always written something that looks like this:

from csv import DictWriter URLS = [ ... ] # thousands of urls for pages we'd like to parse def parse(url): # our logic for parsing the page return data # probably a dict results = [] for url in URLS: # go through each url one by one results.append(parse(url)) with open('results.csv', 'w') as f: writer = DictWriter(f, fieldnames=results[0].keys()) writer.writeheader() writer.writerows(results)

The above is an example of synchronous code -- we're looping through a list of URLs, processing one at a time. If the list of URLs is relatively small or we're not concerned about execution time, there's little reason to parallelize these tasks -- we might as well keep things simple and wait it out.

However, sometimes we have a huge list of URLs -- at least several thousand -- and we can't wait hours for them to finish.

With concurrent.futures , we can work on multiple URLs at once by adding a ProcessPoolExecutor and making a slight change to how we fetch our results.

But first, a reminder: if you're scraping, don't be a jerk . Space out your requests appropriately and don't hammer the site (i.e. use time.sleep to wait briefly between each request and set max_workers to a small number). Being a jerk runs the risk of getting your IP address blocked -- good luck getting that data now.

from concurrent.futures import ProcessPoolExecutor import concurrent.futures URLS = [ ... ] def parse(url): # our logic for parsing the page return data # still probably a dict with ProcessPoolExecutor(max_workers=4) as executor: future_results = {executor.submit(parse, url): url for url in URLS} results = [] for future in concurrent.futures.as_completed(future_results): results.append(future.result())

In the above code, we're submitting tasks to the executor -- four workers -- each of which will execute the parse function against a URL. This execution does not happen immediately. For each submission, the executor returns an instance of a Future , which tells us that our task will be executed at some point in the ... well, future. The as_completed function watches our future_results for completion, upon which we'll be able to fetch each result via the result method.

My favorite part about this module is the clarity of its API -- tasks are submitted to an executor , which is made up of one or more workers, each of which is churning through our tasks. Because our tasks are executed asynchronously, we are not waiting for a given task's completion before submitting another -- we are doing so at-will, with completion happening in the future . Once completed, we can get the task's result .

Closing up

With a few changes to your code and some concurrent.futures love, you no longer have to fetch those basketball stats one page at a time.

But don't be a jerk either.

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

主题: Python
分页:12
转载请注明
本文标题:Asynchronous Scraping with Python
本站链接:http://www.codesec.net/view/483708.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(28)