How to use Webhose.io rated reviews for sentiment classification
How to use Webhose.io rated reviews for sentiment classification
Sentiment classification is a fascinating use case for machine learning. Regardless of complexity you need two core components to deliver meaningful results; a machine learning engine and a significant volume of structured data to train that engine.
Last month, we added the new “rating” field for rated review sites covered in the Webhose.io threaded discussions data feed. With millions of rated reviews, anyone can access high quality structured datasets that include a natural language string and its respective numerical representation of sentiment classification the familiar star rating of 1 through 5.
In this blog post, we show you how to collect your own training datasets of rated reviews and use them to train a model classification (we worked with Stanford NLP, but you can use the classification engine that makes sense for your model). For simplicity, any review of 4 stars and above (rating:>4) is assigned a positive sentiment, while 2 and below (rating:<2) is considered negative.
For our demo, we put together five datasets; Two pairs of train/test split 80% / 20% respectively and another test dataset:General domain model training dataset (80% subset) General domain model testdataset (remaining 20% subset) Domain specific training dataset (80% subset) Domain specific test dataset (remaining 20% subset) Domain specific “blind” dataset never introduced during the training to run the final test
Domain specificity can dramatically improve the results of a sentiment classification engine. For example, a reference to “bugs” in a hotel review is very likely negative. However, a discussion of bugs in a software code review won’t necessarily trigger a negative signal to a sentiment classification engine.
All code samples are freely available on our Sentiment Classifier library on Github . Here’s what you’ll need to set it up yourself:Terminal python 2.7 or above Java 8 Webhose.io free account TOKEN for 1000 renewable monthly requests
Webhose Python SDK1. Setup
Let’s get the basics taken care of:
Install the Webhose Python SDK$ git clone https://github.com/Buzzilla/webhose-python
$ cd webhose-python
$ python setup.py install
Install Apache-Maven and Create a project template:$ cd PROJECT_LOCATION
$ mvn archetype:generate -DgroupId=com.webhose.reviewSentiment-DartifactId=review-sentiment -DarchetypeArtifactId=maven-archetype-quickstart-DinteractiveMode=false 2. Rated Review Dataset Collection
The first component of our code foundation is a python script that uses the webhose-python SDK to collect the rated reviews that will make up our datasets.
The output of this script is a ‘resources’ directory, which will contain the train/test files for our engine.
2.1Set the project directory via Terminal$ cd PROJECT_LOCATION/review-sentiment
2.2 Create the python file which will collect the training/testing data$ touch collect_data.py
2.2 Edit the file ‘collect_data.py’ with a Text Editor or an IDE:
2.2.1 First step of the script is to cover our imports (3rd-party modules), so add those imports to the top of the scriptfrom __future__ import division
2.2.2 Initialize the webhose SDK with your private TOKENWEBHOSE_API_TOKEN = 'YOUR_WEBHOSE_API_TOKEN'
2.2.3 Set the relative location of the train/test filesresources_dir = './src/main/resources'
2.2.4 Build the generic function that will get the necessary data for us from webhose.io, after getting the data the function will create the relevant files inside the ‘resources’ directory.def collect(filename, query, limit, sentiment, partition):
lines = set()
# Collect the data from webhose.io with the given query up to the given limit
response = webhose.search(query)
while len(response.posts) > 0 and len(lines) < limit:
# Go over the list of posts returned from the response
for post in response.posts:
# Verify that the length of the text is not too short nor too long
if 1000 > len(post.text) > 50: # Extracting the text from the post object and clean it text = re.sub(r'(\([^\)]+\)|(stars|rating)\s*:\s*\S+)\s*$', '', post.text.replace('\n', '').replace('\t', ''), 0, re.I) # add the post-text to the lines we are going to save in the train/test file lines.add(text.encode('utf8'))
print 'Getting %s' % response.next
# Request the next 100 results from webhose.io
response = response.get_next()
# Build the train file (first part of the returned documents)
with open(os.path.join(resources_dir, filename + '.train'), 'a+') as train_file:
for line in list(lines)[:int((len(lines))*partition)]:
train_file.write('%s\t%s\n' % (sentiment, line))
# Build the test file (rest of the returned documents)
with open(os.path.join(resources_dir, filename + '.test'), 'a+') as test_file:
for line in list(lines)[int((len(lines))*partition):]:
test_file.write('%s\t%s\n' % (sentiment, line))
2.2.4 Build the queries for the relevant data, and create the files.
Add the ‘__main__’ section of the code, in every call for the ‘collect()’ function, we are passing the filename we want the train/test files to be called, the actual query to webhose.io for the specific data, the limit of lines of text we want to proccess and save, the sentiment class (positive/negative) for the current query and the partition of the recieved data between the train and the test file (80%/20% train/test split)if __name__ == '__main__':
# Create the resources directory if not exists
if not os.path.exists(resources_dir):
# Get reviews from various sources for training and testing the general classifier, overall of 400 lines,
# split the lines 80%/20% between the general.train file and the general.test file
collect('general', 'language:english AND rating:>4 -site:booking.com -site:expedia.*', 400, 'positive', 4/5)
collect('general', 'language:english AND rating:<2 -site:booking.com -site:expedia.*', 400, 'negative', 4/5)
# Get reviews from booking.com for training and testing the domain-specific classifier, overall of 400 lines,
# split the lines 80%/20% between the booking.train file and the booking.test file
collect('booking', 'language:english AND rating:>4 AND site:booking.com', 400, 'positive', 4/5)
collect('booking', 'language:english AND rating:<2 AND site:booking.com', 400, 'negative', 4/5)
# Get reviews from expedia.com for a later tests, overall of 300 lines all lines will be saved on the expedia.test
collect('expedia', 'language:english AND rating:>4 AND site:expedia.com', 300, 'positive', 0)
本文开发（python）相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程
本文标题：How to use Webhose.io rated reviews for sentiment classification