未加星标

Rethinking Modern Data Warehouse with Azure Analysis Services

字体大小 | |
[系统(windows) 所属分类 系统(windows) | 发布者 店小二04 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏

Before I got more familiar with Microsoft Azure and all its PaaS components such as Azure Analysis Services, I was routinely sticking to Microsoft’s on-premises BI stack. With this constrained view on technologies, also some restrictions in terms of cloud-based technology re-interpretations came along.

What actual scenario do I want to point out?

Once, we all were cloud virgins… imagine the following situation: you have your SQL Server Data Warehouse on-premises with regular relational tables and you are using a Tabular Model on top of it. You are building fancy dashboards and hosting them on your Power BI Report Server or you have them running in the PowerBI service. Doesn’t sound new to you, right!?

Now you can do this sort of stuff in Azure. It’s the Modern Data Warehouse approach (for more details on MDWH check this out). “ I know!” …most would say by now. However, what I have experienced in most customer contexts, is that people are aware of this rigid, classical approach using a relational component, either Azure SQL (managed instance) or Azure SQL Data Warehouse, on top of their data lakes with subsequently attaching an Azure Analysis Services Instance hosting some Tabular Models on top of it.

This is what such a typical MDWH cloud architecture could look like:


Rethinking Modern Data Warehouse with Azure Analysis Services
Is there maybe something we don’t see?

There are many good reasons to use relational components, such as SQL DW for instance, especially for big workloads, considering scalability, reliability, as well as available skillsets in the SQL context. Another reason is to have an infrastructure that helps you delegate tasks to dedicated teams in the context of ETL or data reliability and consistency. I really do not want to understate the importance of one accurate and trustworthy source of truth for BI solutions.

On the other hand, we may also face scenarios where we do not need to store data historically (SCD) or where our master data gets fully loaded each day according to given prerequisites by source systems. A big data context could also be legit, where we may not have the need for a classical data modeling approach. Maybe we want to visualize a massive amount of data. For such scenarios, a relational component between data lake and Azure Analysis Services may be causing overhead.

So why not lift our file-based data directly to Azure Analysis Services!?

You may ask yourself, how can this be done?! Is it even possible? In the following sections, I want to briefly outline how, indeed, this can be achieved.

How to feed Azure Analysis Services directly from Azure Blob Storage or Azure Data Lake Store

Assuming we have a nice set of files available in our storage, we can start pretty straightforward. The good news is, we have all required connectors available in the current version of SQL Server Data Tools (SSDT) for Tabular projects.


Rethinking Modern Data Warehouse with Azure Analysis Services

This has some straightforward implications on our possible architecture, which could simply look like the following:


Rethinking Modern Data Warehouse with Azure Analysis Services

However, you have to watch out how files are recognized by the storage connectors. Files in a particular storage location are not recognized as single tables. All contents of a blob storage or a particular Data Lake Store represent a tabular structure:


Rethinking Modern Data Warehouse with Azure Analysis Services

Now if you want to create tabular models out of your files you have multiple approaches to do so. Either you accomplish the task programmatically by using Tabular Object Model (TOM) or you choose to work with SSDT, in particular, the Power Query Editor and the Advanced Editor.

In this blog post, I want to sketch the approach via SSDT and the Power Query Editor. However, please note that you should have the May 2017 Release of SSDT in order to be able to perform the approaches briefly presented in this blog post. Compatibility Mode 1400 is of major importance!

In general, if we talk about large-scale tabular models, we can think of an Azure hosted model of 400 GB cache size. Thanks to the VertiPaq Engine we can achieve a maximum of 10x compression rates so that we could think of loading terabytes of file data in our model.

There are three major aspects which you need to take into consideration when working with this file-based approach in Azure Analysis Services:

1. Work with subsets of the data

If we consider working with terabytes of data in sets of multiple bigger and smaller files, we, in most cases, will not have the resources to load all that data in an instance of Azure Analysis Services, even though SSDT would also try to load all the data into your workspace server. There may be downsides on the storage level but also on the processing side. Therefore, it is advisable working with a representable subset of your data and later on during deployment switching the source of your data to the original files.

2. Keep data management and performance requirements in mind

A possible big data scenario would include loading a vast number of files of all different size ranges in a Tabular Model 1400. In this context we face a few challenges:

Handle metadata correctly (table names, headers, etc. as possible source files may be lacking in terms of metadata) Combine files so that all data needed is together Find a way to process these amounts of data efficiently

How can we overcome these challenges?

The first step is to create a source query for our desired starting table by using the Query Builder in SSDT. In the next step, we can combine the remaining files for the table. During these steps, the Query Builder automatically creates some expressions. However, especially in the context of expressions you need to be careful because by adding new tables (based on files), the Query Builder will add more and more expressions. So, along the way to your final Tabular Model, you will definitely need to do some cleaning to avoid clutter. Try relying on global functions to do the work for you. By doing so you would then also include the handling of the necessary metadata like object names and headers. Technically this is all handled in M functions. But for the purpose of giving you a rough idea I am not digging into the details.

Let’s assume we have loaded our (small) required tables into SSDT and finished designing the model. Now is the time to deploy it on Azure!

3. Parallelized incremental loads on the deployed Tabular Model

Once the model is deployed, we are switching to SQL Server Management Studio. SSMS allows us to directly connect to our Tabular Model. It enables us to perform several operations against our deployed model, such as incremental data loads for efficient data processing.

For incremental data loads, we need to create partitions in the tables within AAS. Partitions in the AAS context will not improve query performance, but they will definitely improve the processing time as they facilitate parallelism in terms of processing. The MS Analysis Services Product Team performed a pretty impressive test in this context. They have managed to load 1 TB of TPC-DS (Decision Support Benchmarking Data) in less than 12 hrs by using partitions on their S9 AAS (28 parallelly running partitions). In contrast a single (no) partition full load takes more than 21 hrs.

What are factors that determine the processing speed?

As the Analysis Se

本文系统(windows)相关术语:三级网络技术 计算机三级网络技术 网络技术基础 计算机网络技术

代码区博客精选文章
分页:12
转载请注明
本文标题:Rethinking Modern Data Warehouse with Azure Analysis Services
本站链接:https://www.codesec.net/view/611434.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 系统(windows) | 评论(0) | 阅读(44)