未加星标

How MariaDB ColumnStore Handles Big Data Workloads Query Engine

字体大小 | |
[数据库(mysql) 所属分类 数据库(mysql) | 发布者 店小二04 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

MariaDB ColumnStore is a massively parallel scale out columnar database. Query execution behaves quite differently from how a row-based engine works. This article outlines how queries are executed, optimized, and how performance can be influenced.

Query Flow

Applications interact using SQL with MariaDB ColumnStore over the standard MariaDB connectors to the MariaDB server. For ColumnStore tables the query is routed through the ColumnStore storage engine.

The ColumnStore storage engine converts the parsed query into a ColumnStore specific format which is then passed to the ExeMgr process for actual execution. In return the ExeMgr retrieves the query results back.

The ExeMgr process converts the query request into a batch of primitive data jobs to perform the actual query. This is where ColumnStore performs query optimization to perform the query as efficiently as possible. The MariaDB server optimizer is mostly bypassed since its goals are driven around optimizing for row-based storage.

The primitive job steps are sent down to the PrimProc process running on each PM server. Filtering, joins, aggregates, and group bys are pushed down to enable scale out performance across multiple nodes.

Each primitive data job is generally a small discrete task that runs in a fraction of second. For filtering or projection jobs the system will read data in parallel and use other cores to process blocks in parallel to produce the result.

Results are returned from the PrimProc to the ExeMgr process. Once results are returned from each PM, the final stage of aggregation is performed to produce the results. Window functions are applied at the ExeMgr level due to the need for sorted results which may ultimately come from different PM nodes.

In the MariaDB server, any order by and select function results are applied before returning results to the client application.

Query Optimization

The ColumnStore optimizer makes use of table statistics including table size and extent map information to optimize query execution.

If a query involves joins table statistics are used to predict which table will have the largest results and make this the driving table. Both table size and the extent min max values are used in this calculation. The other tables will be queried by the ExeMgr and the results passed down to the PM nodes for hash joining with the driving table.

Where clause filters are examined in conjunction with the extent map minimum and maximum values to determine which extents for the column even need to be scanned, drastically reducing the number of extents that must be read. This tends to work particularly well for sorted data such as dates of when an event happened.

Table projection involves first executing any column filters and then projecting the minimum set of other columns necessary to satisfy the query. Column projection is an efficient operation due to the use of fixed length datatypes.

Query Scale Out

A ColumnStore deployment with multiple PM nodes will provide scale out query performance. ColumnStore will automatically distribute data evenly across the nodes. This ensures that each PM node is responsible for a smaller portion of the total data set. Having multiple PM nodes allows for applying more CPU, network and disk I/O to provide for:

Reduction in query time, e.g. going from 2 nodes to 4 nodes will result in results coming back in half the time. Maintaining query response time as your data set grows.

If a time dimension or column is utilized and the data loaded in order or near order, extent map elimination allows for dramatic reduction in disk I/O when data is filtered by that column.

Example Query

To better illustrate query execution within MariaDB ColumnStore, consider the following query which produces a report of total order sum by market segment for nation 24 (US) in calendar Q4 of 2016:

select c.c_mktsegment cust_mkt_segment,

sum(o.o_totalprice) total_order_amount

from orders o join customer c on o.o_custkey = c.c_custkey

where c.c_nationkey = 24

and o.o_orderDATE >= '2016-10-01'

and o.o_orderDATE < '2017-01-01'

group by total_order_amount

order by total_order_amount;

+------------------+--------------------+

| cust_mkt_segment | total_order_amount |

+------------------+--------------------+

| AUTOMOBILE | 352911555.24 |

| BUILDING | 338742302.27 |

| FURNITURE | 342992395.48 |

| HOUSEHOLD | 339851076.28 |

| MACHINERY | 353259581.80 |

+------------------+--------------------+

5 rows in set, 1 warning (0.19 sec)

This query will be executed via the following steps:

The customer table is identified using table statistics as being the smaller / dimension table and filtered first. Scan the customer table across pm nodes on the c_nationkey, c_mktsegment and c_custkey columns. Filter the c_nationkey column to rows matching the value 24 (US companies) and project c_custkey (the customer identifier) and c_mktsegment (the customer market segment). Scan the larger / fact orders table across pm nodes on the o_custkey, o_orderDATE, and o_totalprice columns. Filter the o_orderDATE column to rows with a range of Q4 1997. If there are more than 8M rows in the orders table the system will utilize extent min and max values to eliminate reading those extents that are outside of the range. A hash of c_custkey is built from the customer table results and then a distributed hash join is executed against the order table results in each PM server to produce the set of joined columns between customer and order. A distributed grouping and aggregation is then executed in each PM server to produce the group by results. The results are accumulated in the UM server which also combines any overlapping results coming from different PM servers. Finally the results are sorted by the total_order_amountcolumn in the UM server to produce the final result.

Utility tools are provided within MariaDB ColumnStore to help understand the query optimization process. For further details see the MariaDB ColumnStore knowledge base section on performance : https://mariadb.com/kb/en/mariadb/columnstore-performance-tuning/ .

本文数据库(mysql)相关术语:navicat for mysql mysql workbench mysql数据库 mysql 存储过程 mysql安装图解 mysql教程 mysql 管理工具

主题: CPUSQLFUAUTAUUTRY
分页:12
转载请注明
本文标题:How MariaDB ColumnStore Handles Big Data Workloads Query Engine
本站链接:http://www.codesec.net/view/533151.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(mysql) | 评论(0) | 阅读(48)