Overview of Griffin

At eBay, when people use big data (Hadoop or other streaming systems), measurement of data quality is a significant challenge. Different teams have built customized tools to detect and analyze data quality issues within their own domains. As a platform organization, we think of taking a platform approach to commonly occurring patterns. As such, we are building a platform to provide shared infrastructure and generic features to solve common data quality pain points. This will enable us to build trusted data assets.

Currently it is very difficult and costly to validate data quality when we have large volumes of related data flowing across multiple platforms (streaming and batch). Take eBay’s Bullseye Personalization Platform as an example: every day we have to validate the data quality for ~600M records. Data quality often becomes an enormous challenge in this complex environment and at this massive scale.

Our investigation found the following gaps at eBay:

No end-to-end unified view of data quality frommultipledata sources to target applications that takes account of data lineage. This results in a long delay in identifying and fixing data quality issues. No system to measure data quality in streaming mode through self-service. The need is for a system with a simple tool for registering data assets, defining data quality models, visualizing and monitoring data quality, and alerting teams when an issue is detected. No shared platform and API service. Each team should not have to apply and manage its own hardware and software infrastructure to solve this common problem.

With these needs in mind, we decided to build Griffin, a data quality service that aims to solve these shortcomings.Griffin is an open-source solution for validating the quality of data in an environment with distributed data systems, such as Hadoop, Spark, and Storm. It creates a unified process to define, measure, and report quality for the data assets in these systems. You can see Griffin’s source code at its home page on GitHub.

Features Accuracy measurement: Assessment of the accuracy of a data asset compared to a verifiable source Data profiling: Statistical analysis and assessment of data values within a data asset for consistency, uniqueness, and logic Anomaly detection: Pre-built algorithmic functions for the identification of events that do not conform to an expected pattern in a data asset Visualization: Dashboards that can report the state of data quality Key benefits Real-time: The data quality checks can be executed in real time to detect issues faster. Extensible: The solution can work with multiple data systems. Scalable: The solution is designed to work on large volumes of data. It currently runs on ~1.2 PB of data. Self-serviceable: The solution provides a simple user interface to define new data assets and rules. It also allows users to visualize the data quality dashboards and personalize their view of the dashboards. System process

Griffin has been deployed at eBay and is serving major data systems. Ittakes a platform approach to providing generic features to solve common data quality validation pain points. To detect data quality issues, the key process is as follows.

The user registers the data asset. The Model Engine creates a data quality model for the data asset. The Model Engine calculates metrics. Any data quality issue is reported through email or the web portal.

The following BPMN (Business Process Model and Notation) diagram illustrates the system process.

Griffin―Model-driven Data Quality Service on Cloud for Both Real-time and Batch ...

The following sections describe each step in detail.

Registering the data asset

The user can register the data set to be used for a data quality check. The data set can bebatch data in an RDBMS (for example, Teradata), a Hadoop system, or near real-timestreaming data from Kafka, Storm, and other real-time data platforms. Normally, some basic information should be provided for the data asset, including name, type, schema definition, owner, and other items.

Creating the model

After the data asset is ready, the user can create a data quality model to define the data quality rules and metadata. We can define models for different data quality dimensions, such as accuracy, data profiling, anomaly detection, validity, timeliness, and so on.

Executing the model

The model or rule is executed automatically (by the Model Engine)to get the sample data quality validation results in a few seconds for streaming data. “Data qualitymodel design” introduces the details of how the Model Engine is designed and executed.

Calculating on Spark

The models are running on Spark. They can calculate data quality values for both real-time and batch data. Large-scale data can be handled in a timely fashion.

Generating the metrics value

After the data quality values are calculated, the metrics value is generated based on the calculation results and persisted in the MongoDB database.

Notifying by email

If any metrics value is below its threshold, an email notification is triggered and the end user is notified as soon as any data quality issue occurs.

Web portal and metrics display

Finally, all metrics values are displayed in the web portal, so that the user can analyze thedata quality results through Griffin’s built-in visualizationtool and then take action.

System architecture

To accomplish this process, we designed three layers for the entire system, as shown in the following architecture design diagram:

Data collection and processing layer Back-end service layer User interface
Griffin―Model-driven Data Quality Service on Cloud for Both Real-time and Batch ...
Data collection and processing layer

The key component of this layer is our Model Engine . Griffin is a model-driven solution, and the user can choose various data quality dimensions to execute data quality validation based on a selected target data set or source data set (as the golden reference data). It has a corresponding library supporting it in the back end for measurements.

We support two kinds of data sources: batch data and real-time data.For batch mode, we can collect the data source from our Hadoop platform by various data connectors.For real-time mode, we can connect with messaging systems like Kafka to achieve near real-time analysis. After retrieving the data, the Model Engine computes data quality metrics in our Spark cluster.

Back-end service layer

On the back-end service layer, we have three key components.

The Core Service is responsible for metadata management, such as model definition, subscription management, user customization, and so on. The Job Scheduler is responsible for scheduling the jobs, interacting with Model Engine, saving metrics values, sending email notifications, etc. RESTful web services accomplish all the functions of Griffin, such as registering data sets, creating data quality models, publishing metrics, retrieving metrics, and adding subscriptions. Developers can develop their own user interfaces using these web services. User Interface We have a built-in visualization tool for Griffin. It’s a web front-end application that leverages


主题: SparkHadoopKafkaGitRESTGitHubScalaMongoDBeBay
本文标题:Griffin―Model-driven Data Quality Service on Cloud for Both Real-time and Batch ...

技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(20)