Many machine learning algorithms expect data to be scaled consistently.

There are two popular methods that you should consider when scaling your data for machine learning.

In this tutorial, you will discover how you can rescale your data for machine learning. After reading this tutorial you will know:

How to normalize your data from scratch. How to standardize your data from scratch. When to normalize as opposed to standardizedata.

Let’s get started.

How To Prepare Machine Learning Data From Scratch With python

Photo by Ondra Chotovinsky , some rights reserved.

Description

Many machine learning algorithms expect the scale of the input and even the output data to be equivalent.

It can help in methods that weight inputs in order to make a prediction, such as in linear regression and logistic regression.

It is practically required in methods that combine weighted inputs in complex ways such as in artificial neural networks and deep learning.

In this tutorial, we are going to practice rescaling one standard machine learning dataset in CSV format.

Specifically, the Pima Indians dataset. It contains 768 rows and 9 columns. All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.

Tutorial

This tutorial is divided into 3 parts:

Normalize Data. Standardize Data. When to Normalize and Standardize.

These steps will provide the foundations you need to handle scaling your own data.

1. Normalize Data

Normalization can refer to different techniques depending on context.

Here, we use normalization to refer to rescaling an input variable to the range between 0 and 1.

Normalization requires that you know the minimum and maximum values for each attribute.

This can be estimated from training data or specified directly if you have deep knowledge of the problem domain.

You can easily estimate the minimum and maximum values for each attribute in a dataset by enumerating through the values.

The snippet of code below defines the dataset_minmax() function that calculates the min and max value for each attribute in a dataset, then returns an array of these minimum and maximum values.

# Find the min and max values for each column defdataset_minmax(dataset): minmax = list() for i in range(len(dataset[0])): col_values = [row[i] for rowin dataset] value_min = min(col_values) value_max = max(col_values) minmax.append([value_min, value_max]) return minmax

We can contrive a small dataset for testing as follows:

x1 x2 50 20 20 90

With this contrived dataset, we can test our function for calculating the min and max for each column.

# Find the min and max values for each column defdataset_minmax(dataset): minmax = list() for i in range(len(dataset[0])): col_values = [row[i] for rowin dataset] value_min = min(col_values) value_max = max(col_values) minmax.append([value_min, value_max]) return minmax # Contrive small dataset dataset = [[50, 30], [20, 90]] print(dataset) # Calculate min and max for each column minmax = dataset_minmax(dataset) print(minmax)

Running the example produces the following output.

First, the dataset is printed in a list of lists format, then the min and max for each column is printed in the format column1: min,max and column2: min,max .

For example:

[[50, 30], [20, 90]] [[20, 50], [30, 90]]

Once we have estimates of the maximum and minimum allowed values for each column, we can now normalize the raw data to the range 0 and 1.

The calculation to normalize a single value for a columnis:

scaled_value = (value - min) / (max - min)

Below is an implementation of this in a function called normalize_dataset() that normalizes values in each column of a provided dataset.

# Rescale dataset columns to the range 0-1 defnormalize_dataset(dataset, minmax): for rowin dataset: for i in range(len(row)): row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])

We can tie this function together with the dataset_minmax() function and normalize the contrived dataset.

# Find the min and max values for each column defdataset_minmax(dataset): minmax = list() for i in range(len(dataset[0])): col_values = [row[i] for rowin dataset] value_min = min(col_values) value_max = max(col_values) minmax.append([value_min, value_max]) return minmax # Rescale dataset columns to the range 0-1 defnormalize_dataset(dataset, minmax): for rowin dataset: for i in range(len(row)): row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0]) # Contrive small dataset dataset = [[50, 30], [20, 90]] print(dataset) # Calculate min and max for each column minmax = dataset_minmax(dataset) print(minmax) # Normalize columns normalize_dataset(dataset, minmax) print(dataset)

Running this example prints the output below, including the normalized dataset.

[[50, 30], [20, 90]] [[20, 50], [30, 90]] [[1, 0], [0, 1]]

We can combine this code with code for loading a CSV dataset and load and normalize the Pima Indians diabetes dataset.

Download the Pima Indians dataset from the UCI Machine Learning Repository and place it in your current directory with the name pima-indians-diabetes.csv . Open the file and delete any empty lines at the bottom.

The example first loads the dataset and converts the values for each column from string to floating point values. The minimum and maximum values for each column are estimated from the dataset, and finally, the values in the dataset are normalized.

fromcsvimportreader # Load a CSV file defload_csv(filename): file = open(filename, "rb") lines = reader(file) dataset = list(lines) return dataset # Convert string column to float defstr_column_to_float(dataset, column): for rowin dataset: row[column] = float(row[column].strip()) # Find the min and max values for each column defdataset_minmax(dataset): minmax = list() for i in range(len(dataset[0])): col_values = [row[i] for rowin dataset] value_min = min(col_values) value_max = max(col_values) minmax.append([value_min, value_max]) return minmax # Rescale dataset columns to the range 0-1 defnormalize_dataset(dataset, minmax): for rowin dataset: for i in range(len(row)): row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0]) # Load pima-indians-diabetes dataset filename = 'pima-indians-diabetes.csv' dataset = load_csv(filename) print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0])) # convert string columns to float for i in range(len(dataset[0])): str_column_to_float(dataset, i) print(dataset[0]) # Calculate min and max for each column minmax = dataset_minmax(dataset) # Normalize columns normalize_dataset(dataset, minmax) print(dataset[0])

Running the example produces the output below.

The first record from the dataset is printed before and after normalization, showing the effect of

1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责；
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性，不作出任何保证或承若；
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。