未加星标

Ugly soup with Python, requests & Beautiful Soup

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二04 | 时间 2019 | 作者 红领巾 ] 0人收藏点击收藏

Web scraping has never been a coveted nor favorite discipline of mine; in fact, for me web scraping is an unfortunate, but sometimes necessary evil. Scraping web-pages, at least for me, is a very unstructured process, basically pure Trial & Error. Perhaps, with more experience, more interest in html and the www in general, it might become more likable, akin to the other types of programming I really enjoy doing… Who knows….

Anyways, I wanted to scrape some data for further analysis down the line.Normally, I’d put Pandas to heavy use for a lot of the data munging tasks, but since my neighborhood is currently experiencing the Mother of all Power outtages, after the storm of the century on past Tuesday, I dont have power to run my computer thus, this stuff is done with pythonista (2!) on my old & tired iPad. Btw, Pythonista is a really great Python environment for i*, and I guess some day I should upgrade tp Pythonista3…. But I sure miss Pandas; it makes structured data manipulation so very much more convenient than using lists or even numpy arrays….!

So, for this exercise: I wanted to collect the results from a very famous long distance cross country ski race, the Marcialonga, which takes place each January in my favorite place on this earth, Val di Fiemme, in the Dolomites. I’d preferred for the organisers of the race to publish the results in a more easy-to-grab format than html, but I couldn’t find anything else but the web page. Which furthermore splits the 5600 result entries to 56 pages… which took a while to figure out how to scrape multiple linked pages.

Below a couple of screen shots of an initial, very basic analysis. I’ll do quite a bit more statistical analysis if and when the power grid resumes operations…


Ugly soup with Python, requests & Beautiful Soup
# coding: utf-8 import requests from requests.utils import quote import re from bs4 import BeautifulSoup as bs import numpy as np from datetime import datetime,timedelta import matplotlib.pyplot as plt comp_list = [] # results are in 56 separate pages for page in range(1,57): print page url = 'https://www.marcialonga.it/marcialonga_ski/EN_results.php' payload = {'pagenum':page} print url print payload r = requests.get(url,params=payload) print r.status_code #print r.text c = r.content soup = bs(c,'html.parser') #print soup.prettify() tbl= soup.find('table') #print tbl main_table = tbl #print main_table #print main_table.get_text() competitors = main_table.find_all(class_='SP') for comp in competitors: comp_list.append(comp.get_text()) comp_list = list(map(lambda x : x.encode('utf-8'),comp_list)) #print comp_list print len(comp_list) def parse_item(i): res_pattern = r'[0-9]+' char_pattern = r'[A-Z]+' num_pattern = r'[0-9]+:[0-9]+:[0-9]+\.[0-9]' age_pattern = r'[0-9]+/' res =re.match(res_pattern,i).group() chars = re.findall(char_pattern,i,flags=re.IGNORECASE) nat = chars[-1][1:] nums = re.findall(num_pattern,i) age = re.findall(age_pattern,i) age_over=age[0][:-1] time_pattern = r'0[0-9]:[0-9]+:[0-9]+\.[0-9]$' time = re.findall(time_pattern,nums[0]) t = datetime.strptime(time[0],'%H:%M:%S.%f') #t = datetime.strftime(t,'%H:%M:%S.%f') td = (t - datetime(1900,1,1)).total_seconds() name = chars[0] + ' ' + chars[1] gender = chars[-2] return (res, name,nat,gender,age_over,td) results = [] for comp in comp_list: results.append(parse_item(comp)) ''' results = np.array(results,dtype=[('res','i4'),('name','U100'),('nat','U3'),('gender','U1'),('age','i4'),('time','datetime64[us]')]) ''' gender = np.array([results[i][3] for i in range(len(results))]) pos = np.array([results[i][0] for i in range(len(results))]) ages = np.array([results[i][4] for i in range(len(results))]).astype(int) secs = np.array([results[i][5] for i in range(len(results))]) print ages.size print secs.size friend_1 = 18844 friend_2= 24446 male_mask = gender=='M' male_secs = secs[male_mask] female_secs = secs[~male_mask] male_mean = male_secs.mean() female_mean = female_secs.mean() bins=range(10000,38000,1000) plt.subplot(211) plt.hist(male_secs,color='b',weights=np.zeros_like(male_secs) + 1. / male_secs.size,alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',weights=np.zeros_like(female_secs) + 1. / female_secs.size,alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Relative Frequency') plt.axvline(friend_1,ls='dashed',color='cyan',label='Friend_1',lw=5) plt.axvline(friend_2,ls='dashed',color='magenta',label='Friend_2',lw=5) plt.axvline(male_mean,ls='dashed',color='darkblue',label='Men mean',lw=5) plt.axvline(female_mean,ls='dashed',color='darkred',label='Women mean',lw=5) plt.legend(loc='upper right') def colors(x): if gender[x] == 'M': return 'b' else: return 'r' print secs.min(),secs.max() colormap = list(map(colors,range(len(gender)))) plt.subplot(212) plt.hist(male_secs,color='b',alpha=0.5,label='Men',bins=bins) plt.hist(female_secs,color='r',alpha=0.5,label='Women',bins=bins) plt.title('2018 Marcialonga Ski Race - time distribution Men vs Women') plt.xlabel('Time [seconds]') plt.ylabel('Nr of Skiers') plt.legend(loc='upper right') plt.tight_layout() plt.show()

the end :wink:

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

代码区博客精选文章
分页:12
转载请注明
本文标题:Ugly soup with Python, requests & Beautiful Soup
本站链接:https://www.codesec.net/view/628468.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(193)