未加星标

Likes Out! Guerilla Dataset!

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二03 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏

-- Zack de la Rocha

tl;dr -> I collected an implicit feedback dataset along with side-information about the items. This dataset contains around 62,000 users and 28,000 items. All the data lives here inside of this repo. Enjoy!

In a previouspost, I wrote about how to use matrix factorization and explicit feedback data in order to build recommendation systems. This is data where a user has given a clear preference for an item such as a star rating for an Amazon product or a numerical rating for a movie like in the MovieLens data. A natural next step is to discuss recommendation systems for implicit feedback which is data where a user has shown a preference for an item like "number of minutes listened" for a song on Spotify or "number of times clicked" for a product on a website.

Implicit feedback-based techniques likely consitute the majority of modern recommender systems. When I set out to write a post on these techniques, I found it difficult to find suitable data. This makes sense - most companies are loathe to share users' click or usage data (and for good reasons). A cursory google search revealed a couple datasets that people use, but I kept finding issues with these datasets. For example, the million song database was shown to have some issues with data quality, while many other people just repurposed the MovieLens or Netflix data as though it was implicit (which it is not).

This started to feel like one of those "fuck it, I'll do it myself" things. And so I did.

All code for collecting this data is located on my github . The actual collected data lives in this repo, as well.

Sketchfab

Back when I was a graduate student, I thought for some time that maybe I would work in the hardware space (or at a museum, or the government, or a gazillion other things). I wanted to have public, digital proof of my ( shitty ) CAD skills, and I stumbled upon Sketchfab , a website which allows you to share 3D renderings that anybody else with a browser can rotate, zoom, or watch animate. It's kind of like YouTube for 3D (and now VR!).

Liopleurodon Ferox Swim Cycle by Kyan0s on Sketchfab

Users can "like" 3D models which is an excellent implicit signal. It turns out you can actually see which user liked which model. This presumably allows one to reconstruct the classic recommendation system "ratings matrix" of users as rows and 3D models as columns with likes as the elements in the sparse matrix.

Okay, I can see the likes on the website, but how do I actually get the data?

Crawling with Selenium

When I was at Insight Data Science , I built an ugly script to scrape a tutoring website. This was relatively easy. The site was largely static, so I used BeautifulSoup to simply parse through the HTML.

Sketchfab is a more modern site with extensive javascript. One must wait for the javascript to render the HTML before parsing through it. A method of automating this is to use Selenium . This software essentially lets you write code to drive an actual web browser.

To get up and running with Selenium, you must first download a driver to run your browser. I went here to get a Chrome driver. The python Selenium package can then be installed using anaconda on the conda-forge channel:

conda install --channel https://conda.anaconda.org/conda-forge selenium

Opening a browser window with Selenium is quite simple:

In[]:

from selenium import webdriver chromedriver = '/path/to/chromedriver' BROWSER = webdriver.Chrome(chromedriver)

Now we must decide where to point the browser.

Sketchfab has over 1 Million 3D models and more than 600,000 users. However, not every user has liked a model, and not every model has been liked by a user. I decided to limit my search to models that had been liked by at least 5 users. To start my crawling, I went to the "all" page for popular models (sorted by number of likes, descending) and started crawling from the top.

In[]:

BROWSER.get('https://sketchfab.com/models?sort_by=-likeCount&page=1')

Upon opening the main models page, you can open the chrome developer tools (ctrl-shift-i in linux) to reveal the HTML structure of the page. This looks like the following (click to view full-size):


Likes Out! Guerilla Dataset!

Looking through the HTML reveals that all of the displayed 3D models are housed in a <div> of class infinite-grid . Each 3D model is inside of a <li> element with class item . One can grab the list of all these list elements as follows:

In[]: elem = BROWSER.find_element_by_xpath("//div[@class='infinite-grid']") item_list = elem.find_elements_by_xpath(".//li[@class='item'

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

主题: HTMLChromePython
分页:12
转载请注明
本文标题:Likes Out! Guerilla Dataset!
本站链接:http://www.codesec.net/view/480686.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(39)