未加星标

How to Load Machine Learning Data From Scratch In Python

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二03 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏

You must know how to load data before you can use it to train a machine learning model.

When starting out, it is a good idea to stick with small in-memory datasets using standard file formats like comma separated value (.csv).

In this tutorial you will discover how to load your data in python from scratch, including:

How to load a CSV file. How to convert strings from a file to floating point numbers. How to convert class values from a file to integers.

Let’s get started.


How to Load Machine Learning Data From Scratch In Python

How to Load Machine Learning Data From Scratch In Python

Photo by Amanda B , some rights reserved.

Description Comma SeparatedValues

The standard file format for small datasets is Comma Separated Values or CSV.

In it’s simplest form, CSV files are comprised of rows of data. Each row is divided into columns using a comma (“,”).

You can learn more about the CSV file format in RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files .

In this tutorial, we are going to practice loading two different standard machine learning datasets in CSV format.

Pima Indians Diabetes Dataset

The first is the Pima Indians diabetes dataset. It contains 768 rows and 9 columns.

All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.

You can learn more about this dataset on the UCI Machine Learning Repository .

Iris Flower Species Dataset

The second dataset we will work with is the iris flowers dataset.

It contains 150 rows and 4 columns. The first 3 columns are numeric. It is different in that the class value (final column) is a string, indicating a species of flower. We will learn how to convert the numeric columns from string to numbers and how to convert the flower species string into an integer that we can use consistently.

You can learn more about this dataset on the UCI Machine Learning Repository .

Tutorial

This tutorial is divided into 3 parts:

Load a file. Load a file and convert Strings to Floats. Load a file and convert Strings to Integers.

These steps will provide the foundations you need to handle loading your own data.

1. Load CSV File

The first step is to load the CSV file.

We will use the csv module that is a part of the standard library.

The reader() function in the csv module takes a file as an argument.

We will create a function called load_csv() to wrap this behavior that will take a filename and return our dataset. We will represent the loaded dataset as a list of lists. The first list is a list of observations or rows, and the second list is the list of column values for a given row.

Below is the complete function for loading a CSV file.

fromcsvimportreader # Load a CSV file defload_csv(filename): file = open(filename, "rb") lines = reader(file) dataset = list(lines) return dataset

We can test this function by loading the Pima Indians dataset. Download the dataset and place it in the current working directory with the name pima-indians-diabetes.csv . Open the file and delete any empty lines at the bottom.

Taking a peek at the first 5 rows of the raw data file we can see the following:

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1

The data is numeric and separated by commas and we can expect that the whole file meets this expectation.

Let’s use the new function and load the dataset. Once loaded we can report some simple details such as the number of rows and columns loaded.

Putting all of this together, we get the following:

fromcsvimportreader # Load a CSV file defload_csv(filename): file = open(filename, "rb") lines = reader(file) dataset = list(lines) return dataset # Load dataset filename = 'pima-indians-diabetes.csv' dataset = load_csv(filename) print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0]))

Running this example we see:

Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns 2. Convert String to Floats

Most, if not all machine learning algorithms prefer to work with numbers.

Specifically, floating point numbers are preferred.

Our code for loading a CSV file returns a dataset as a list of lists, but each value is a string. We can see this if we print out one record from the dataset:

print(dataset[0])

This produces output like:

['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']

We can write a small function to convert specific columns of our loaded dataset to floating point values.

Below is this function called str_column_to_float() . It will convert a given column in the dataset to floating point values, careful to strip any whitespace from the value before making the conversion.

defstr_column_to_float(dataset, column): for rowin dataset: row[column] = float(row[column].strip())

We can test this function by combining it with our load CSV function above, and convert all of the numeric data in the Pima Indians dataset to floating point values.

The complete example is below.

fromcsvimportreader # Load a CSV file defload_csv(filename): file = open(filename, "rb") lines = reader(file) dataset = list(lines) return dataset # Convert string column to float defstr_column_to_float(dataset, column): for rowin dataset: row[column] = float(row[column].strip()) # Load pima-indians-diabetes dataset filename = 'pima-indians-diabetes.csv' dataset = load_csv(filename) print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0])) print(dataset[0]) # convert string columns to float for i in range(len(dataset[0])): str_column_to_float(dataset, i) print(dataset[0])

Running this example we see the first row of the dataset printed both before and after the conversion. We can see that the values in each column have been converted from strings to numbers.

Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns ['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'] [6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0] 3. Convert String to Integers

The iris flowers dataset is like the Pima Indians dataset, in that the columns contain numeric data.

The difference is the final column, traditionally used to hold the outcome or value to be predicted for a given row. The final column in the iris flowers data is the iris flower species as a string.

Download the dataset and place it in the current working directory with the file name iris.csv . Open the file and delete any empty lines at the bottom.

For example, below are the first 5 rows of the raw dataset.

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

主题: PythonUC
分页:12
转载请注明
本文标题:How to Load Machine Learning Data From Scratch In Python
本站链接:http://www.codesec.net/view/481745.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(32)