-- Zack de la Rocha
tl;dr -> I collected an implicit feedback dataset along with side-information about the items. This dataset contains around 62,000 users and 28,000 items. All the data lives here inside of this repo. Enjoy!
In a previouspost, I wrote about how to use matrix factorization and explicit feedback data in order to build recommendation systems. This is data where a user has given a clear preference for an item such as a star rating for an Amazon product or a numerical rating for a movie like in the MovieLens data. A natural next step is to discuss recommendation systems for implicit feedback which is data where a user has shown a preference for an item like "number of minutes listened" for a song on Spotify or "number of times clicked" for a product on a website.
Implicit feedback-based techniques likely consitute the majority of modern recommender systems. When I set out to write a post on these techniques, I found it difficult to find suitable data. This makes sense - most companies are loathe to share users' click or usage data (and for good reasons). A cursory google search revealed a couple datasets that people use, but I kept finding issues with these datasets. For example, the million song database was shown to have some issues with data quality, while many other people just repurposed the MovieLens or Netflix data as though it was implicit (which it is not).
This started to feel like one of those "fuck it, I'll do it myself" things. And so I did.
All code for collecting this data is located on my github . The actual collected data lives in this repo, as well.Sketchfab
Back when I was a graduate student, I thought for some time that maybe I would work in the hardware space (or at a museum, or the government, or a gazillion other things). I wanted to have public, digital proof of my ( shitty ) CAD skills, and I stumbled upon Sketchfab , a website which allows you to share 3D renderings that anybody else with a browser can rotate, zoom, or watch animate. It's kind of like YouTube for 3D (and now VR!).
Liopleurodon Ferox Swim Cycle by Kyan0s on Sketchfab
Users can "like" 3D models which is an excellent implicit signal. It turns out you can actually see which user liked which model. This presumably allows one to reconstruct the classic recommendation system "ratings matrix" of users as rows and 3D models as columns with likes as the elements in the sparse matrix.
Okay, I can see the likes on the website, but how do I actually get the data?Crawling with Selenium
When I was at Insight Data Science , I built an ugly script to scrape a tutoring website. This was relatively easy. The site was largely static, so I used BeautifulSoup to simply parse through the HTML.
To get up and running with Selenium, you must first download a driver to run your browser. I went here to get a Chrome driver. The python Selenium package can then be installed using anaconda on the conda-forge channel:conda install --channel https://conda.anaconda.org/conda-forge selenium
Opening a browser window with Selenium is quite simple:In:
from selenium import webdriver chromedriver = '/path/to/chromedriver' BROWSER = webdriver.Chrome(chromedriver)
Now we must decide where to point the browser.
Sketchfab has over 1 Million 3D models and more than 600,000 users. However, not every user has liked a model, and not every model has been liked by a user. I decided to limit my search to models that had been liked by at least 5 users. To start my crawling, I went to the "all" page for popular models (sorted by number of likes, descending) and started crawling from the top.In:
Upon opening the main models page, you can open the chrome developer tools (ctrl-shift-i in linux) to reveal the HTML structure of the page. This looks like the following (click to view full-size):
Looking through the HTML reveals that all of the displayed 3D models are housed in a <div> of class infinite-grid . Each 3D model is inside of a <li> element with class item . One can grab the list of all these list elements as follows:In: elem = BROWSER.find_element_by_xpath("//div[@class='infinite-grid']") item_list = elem.find_elements_by_xpath(".//li[@class='item'
本文开发（python）相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程