未加星标

How to use Webhose.io rated reviews for sentiment classification

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二03 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

How to use Webhose.io rated reviews for sentiment classification

Sentiment classification is a fascinating use case for machine learning. Regardless of complexity you need two core components to deliver meaningful results; a machine learning engine and a significant volume of structured data to train that engine.

Last month, we added the new “rating” field for rated review sites covered in the Webhose.io threaded discussions data feed. With millions of rated reviews, anyone can access high quality structured datasets that include a natural language string and its respective numerical representation of sentiment classification the familiar star rating of 1 through 5.

In this blog post, we show you how to collect your own training datasets of rated reviews and use them to train a model classification (we worked with Stanford NLP, but you can use the classification engine that makes sense for your model). For simplicity, any review of 4 stars and above (rating:>4) is assigned a positive sentiment, while 2 and below (rating:<2) is considered negative.

For our demo, we put together five datasets; Two pairs of train/test split 80% / 20% respectively and another test dataset:

General domain model training dataset (80% subset) General domain model testdataset (remaining 20% subset) Domain specific training dataset (80% subset) Domain specific test dataset (remaining 20% subset) Domain specific “blind” dataset never introduced during the training to run the final test

Domain specificity can dramatically improve the results of a sentiment classification engine. For example, a reference to “bugs” in a hotel review is very likely negative. However, a discussion of bugs in a software code review won’t necessarily trigger a negative signal to a sentiment classification engine.

All code samples are freely available on our Sentiment Classifier library on Github . Here’s what you’ll need to set it up yourself:

Terminal python 2.7 or above Java 8 Webhose.io free account TOKEN for 1000 renewable monthly requests

Webhose Python SDK

1. Setup

Let’s get the basics taken care of:

Install the Webhose Python SDK

$ git clone https://github.com/Buzzilla/webhose-python
$ cd webhose-python
$ python setup.py install

Install Apache-Maven and Create a project template:

$ cd PROJECT_LOCATION
$ mvn archetype:generate -DgroupId=com.webhose.reviewSentiment-DartifactId=review-sentiment -DarchetypeArtifactId=maven-archetype-quickstart-DinteractiveMode=false 2. Rated Review Dataset Collection

The first component of our code foundation is a python script that uses the webhose-python SDK to collect the rated reviews that will make up our datasets.

The output of this script is a ‘resources’ directory, which will contain the train/test files for our engine.

2.1Set the project directory via Terminal

$ cd PROJECT_LOCATION/review-sentiment

2.2 Create the python file which will collect the training/testing data

$ touch collect_data.py

2.2 Edit the file ‘collect_data.py’ with a Text Editor or an IDE:

2.2.1 First step of the script is to cover our imports (3rd-party modules), so add those imports to the top of the script

from __future__ import division
import os
import re
import time
import webhose

2.2.2 Initialize the webhose SDK with your private TOKEN

WEBHOSE_API_TOKEN = 'YOUR_WEBHOSE_API_TOKEN'
webhose.config(WEBHOSE_API_TOKEN)

2.2.3 Set the relative location of the train/test files

resources_dir = './src/main/resources'

2.2.4 Build the generic function that will get the necessary data for us from webhose.io, after getting the data the function will create the relevant files inside the ‘resources’ directory.

def collect(filename, query, limit, sentiment, partition):
lines = set()
# Collect the data from webhose.io with the given query up to the given limit
response = webhose.search(query)
while len(response.posts) > 0 and len(lines) < limit:
# Go over the list of posts returned from the response
for post in response.posts:
# Verify that the length of the text is not too short nor too long
if 1000 > len(post.text) > 50: # Extracting the text from the post object and clean it text = re.sub(r'(\([^\)]+\)|(stars|rating)\s*:\s*\S+)\s*$', '', post.text.replace('\n', '').replace('\t', ''), 0, re.I) # add the post-text to the lines we are going to save in the train/test file lines.add(text.encode('utf8'))
time.sleep(2)
print 'Getting %s' % response.next
# Request the next 100 results from webhose.io
response = response.get_next()
# Build the train file (first part of the returned documents)
with open(os.path.join(resources_dir, filename + '.train'), 'a+') as train_file:
for line in list(lines)[:int((len(lines))*partition)]:
train_file.write('%s\t%s\n' % (sentiment, line))
# Build the test file (rest of the returned documents)
with open(os.path.join(resources_dir, filename + '.test'), 'a+') as test_file:
for line in list(lines)[int((len(lines))*partition):]:
test_file.write('%s\t%s\n' % (sentiment, line))

2.2.4 Build the queries for the relevant data, and create the files.

Add the ‘__main__’ section of the code, in every call for the ‘collect()’ function, we are passing the filename we want the train/test files to be called, the actual query to webhose.io for the specific data, the limit of lines of text we want to proccess and save, the sentiment class (positive/negative) for the current query and the partition of the recieved data between the train and the test file (80%/20% train/test split)

if __name__ == '__main__':
# Create the resources directory if not exists
if not os.path.exists(resources_dir):
os.makedirs(resources_dir)
# Get reviews from various sources for training and testing the general classifier, overall of 400 lines,
# split the lines 80%/20% between the general.train file and the general.test file
collect('general', 'language:english AND rating:>4 -site:booking.com -site:expedia.*', 400, 'positive', 4/5)
collect('general', 'language:english AND rating:<2 -site:booking.com -site:expedia.*', 400, 'negative', 4/5)
# Get reviews from booking.com for training and testing the domain-specific classifier, overall of 400 lines,
# split the lines 80%/20% between the booking.train file and the booking.test file
collect('booking', 'language:english AND rating:>4 AND site:booking.com', 400, 'positive', 4/5)
collect('booking', 'language:english AND rating:<2 AND site:booking.com', 400, 'negative', 4/5)
# Get reviews from expedia.com for a later tests, overall of 300 lines all lines will be saved on the expedia.test
collect('expedia', 'language:english AND rating:>4 AND site:expedia.com', 300, 'positive', 0)

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

主题: GitJavaPythonTI
分页:12
转载请注明
本文标题:How to use Webhose.io rated reviews for sentiment classification
本站链接:http://www.codesec.net/view/532101.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(21)