Long I wanted to test a simple data manipulation tasks and compare the execution time, ease of writing the code and simplicity between T-SQL and R package for data manipulation.

Couple of packages I will mention for data manipulations are plyr, dplyr and data.table and compare the execution time, simplicity and ease of writing with general T-SQL code and RevoScaleR package. For this blog post I will use R package dplyr and T-SQL with possibilites of RevoScaleR computation functions.

My initial query will be. Available in WideWorldImportersDW database. No other alterations have been done to underlying tables (fact.sale or dimension.city).

USE WideWorldImportersDW; GO SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN [Dimension].[City] AS c ON c.[City Key] = fs.[City Key];

For the metrics of comparison I will be using simple time statistics (metrics:elapsed time) and every code will be run from SSMS.

-- Example SET STATISTICS TIME ON SELECT GETDATE() SET STATISTICS TIME OFF SELECT

First simple test is just printing the data.

PRINT ' ' PRINT 'STATISTICS WITH T-SQL' PRINT ' ' SET STATISTICS TIME ON SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key] SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N'OutputDataSet <- InputDataSet' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATE ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF

This test is very much simple. Outputting the data. In case to T-SQL it is just a simple SELECT statement and in case of R-TSQL is just passing data to R Engine and returning in a data-frame. The test itself is trivial and “absurd” but very important to get the base-line and time related results.


Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple d ...

Running the query couple of times on my client, I have some base comparison. T-SQL outperforms R-TSQL and it is almost two-times faster. I would expect higher difference in terms of time, because it involves a lot of data migration and no computations.

Filtering / WHERE

Now let’s perform a simple WHERE clause or some simple filtering.

PRINT ' ' PRINT 'STATISTICS WITH T-SQL' PRINT ' ' -- SIMPLE T-SQL SET STATISTICS TIME ON SELECT fs.[City Key] ,c.City ,c.[Sales Territory] ,fs.[Customer Key] ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key] WHERE [Sales Territory] <> 'Southeast' AND fs.[Customer Key] <> 0 AND Profit <= 1000 SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R' PRINT ' ' -- R Package dplyr and T-SQL SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(dplyr) OutputDataSet <- InputDataSet %>% filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATETIME ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF
Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple d ...

Again results of both executions are the same but difference is getting bigger and bigger. Main reason given is that all the data must be transferred to R engine, before being manipulated and results returned.

Now I can perform a simple comparison with dplyr package and RevoscaleR package. In this manner we will have same ground base for comparison.

PRINT ' ' PRINT 'STATISTICS WITH R dpylr' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(dplyr) OutputDataSet <- InputDataSet %>% filter(SalesTerritory != "Southeast", CustomerKey != 0, Profit <= 1000)' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( [City Key] INT ,[City] NVARCHAR(200) ,[Sales Territory] NVARCHAR(200) ,[Customer Key] INT ,[Invoice Date Key] DATETIME ,[Quantity] INT ,[Unit Price] DECIMAL(18,3) ,[Tax Amount] DECIMAL(18,3) ,[Total Including Tax] DECIMAL(18,2) ,[Profit] DECIMAL(18,2) )); SET STATISTICS TIME OFF PRINT ' ' PRINT 'STATISTICS WITH R RevoScaleR' PRINT ' ' SET STATISTICS TIME ON DECLARE @TSQL1 AS NVARCHAR(MAX) SET @TSQL1 = N'SELECT fs.[City Key] ,c.City ,c.[Sales Territory] AS SalesTerritory ,fs.[Customer Key] AS CustomerKey ,fs.[Invoice Date Key] ,fs.[Quantity] ,fs.[Unit Price] ,fs.[Tax Amount] ,fs.[Total Including Tax] ,fs.Profit FROM [Fact].[Sale] AS fs JOIN dimension.city AS c ON c.[City Key] = fs.[City Key]' DECLA

本文数据库(mssql)相关术语:熊片数据库 mssql数据库 oracle数据库 pubmed数据库 access数据库 万方数据库

主题: SQLTISU
分页:12
转载请注明
本文标题:Comparing performance on dplyr package, RevoScaleR package and T-SQL on simple d ...
本站链接:http://www.codesec.net/view/481551.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(mssql) | 评论(0) | 阅读(32)