Zbigniew Baranowski is a database systems specialist and a member of a group which provides and supports central database and Hadoop-based services at CERN . This blog was originally released on CERN’s “Databases at CERN” blog, and is syndicated here with CERN’s permission.


This post presents a performance comparison of few popular data formats and storage engines available in the Apache Hadoop ecosystem: Apache Avro , Apache Parquet , Apache HBase and Apache Kudu on the field of space efficiency, ingestion performance, analytic scans and random data lookup. This should help in understanding how (and when) each of them can improve handling of your big data workloads.


The initial idea for making a comparison of Hadoop file formats and storage engines was driven by a revision of one of the first systems that adopted Hadoop at large scale at CERN the ATLAS EventIndex.

This project was started in 2012, at a time when processing CSV with MapReduce was a common way of dealing with big data. At the same time platforms like Apache Spark, Apache Impala (incubating), or file formats like Avro and Parquet were not as mature and popular like nowadays or were even not started. Therefore in retrospect the chosen design based on using HDFS MapFiles has a notion of being ‘old’ and less popular.

The ultimate goal of our tests with ATLAS EventIndex data was to understand which approach for storing the data would be optimal to apply and what are expected benefits of such application with the respect to main use case of the system. The main aspects we wanted to compare were data volume and performance of

data ingestion, random data lookup full data scanning


ATLAS is one of seven particle detector experiments constructed for the Large Hadron Collider , a particle accelerator at CERN.

ATLAS EventIndex is a metadata catalogue of all collisions (called ‘events’) that happened in the ATLAS experiment and later were accepted to be permanently stored within CERN storage infrastructure (typically it is few hundreds of events per second). Physicists use this system to identify and locate events of interest, group events populations by commonalities and check a production cycle consistency.

Each indexed collision is stored in the ATLAS EventIndex as a separate record that in average is 1.5KB long and has 56 attributes, where 6 of them uniquely identifies a collision. Most of the attributes are text type, only a few of them are numeric. At the given moment there are 6e10 of records stored in HDFS that occupies tens of Terabytes (no including data replication).


The same data sets have been stored on the same Hadoop cluster using different storage techniques and compression algorithms (Snappy, GZip or BZip2):

Apache Avro is a data serialization standard for compact binary format widely used for storing persistent data on HDFS as well as for communication protocols. One of the advantages of using Avro is lightweight and fast data serialisation and deserialization, which can deliver very good ingestion performance. Additionally, even though it does not have any internal index (like in the case of MapFiles), HDFS directory-based partitioning technique can be applied to quickly navigate to the collections of interest when fast random data access is needed.

In the test, a tuple of the first 3 columns of a primary key was used as a partitioning key. This allowed obtaining good balance between the number of partitions (few thousands) and an average partitions size (hundreds of megabytes)

Apache Parquet is column oriented data serialization standard for efficient data analytics. Additional optimizations include encodings (RLE, Dictionary, Bit packing) and compression applied on series of values from the same columns give very good compaction ratios. When storing data on HDFS in Parquet format, the same partitioning strategy was used as in Avro case. Apache HBase scalable and distributed NoSQL database on HDFS for storing key-value pairs. Keys are indexed which typically provides very quick access to the records.

When storing ATLAS EventIndex data into HBase each event attribute was stored in a separate cell and row key was composed as a concatenation of an event identification attributes columns. Additionally, differential (FAST_DIFF) encoding of a row key (DATA_BLOCK_ENCODING) was enabled in order to reduce a size of HBase blocks (without this each row would have the length of 8KB).

Apache Kudu is new scalable and distributed table-based storage. Kudu provides indexing and columnar data organization to achieve a good compromise between ingestion speed and analytics performance. Like in HBase case, Kudu APIs allows modifying the data already stored in the system.

In the evaluation, all literal types were stored with a dictionary encoding and numeric types with bit shuffle encoding. Additionally, a combination of range and hash partitioning introduced by using the first column of the primary key (composed of the same columns like in the HBase case) as a partitioning key.


The data access and ingestion tests were on a cluster composed of 14 physical machines, each equipped with:

2 x 8 cores @2.60GHz 64GB of RAM 2 x 24 SAS drives

Hadoop cluster was installed from Cloudera Data Hub(CDH) distribution version 5.7.0, this includes:

Hadoop core 2.6.0 Impala 2.5.0 Hive 1.1.0 HBase 1.2.0 (configured JVM heap size for region servers = 30GB) (not from CDH) Kudu 1.0 (configured memory limit = 30GB)

Apache Impala (incubating) was used as a data ingestion and data access framework in all the conducted tests presented later in this report.

Important:Despite the effort made to obtain as much precise results as possible, they should not be treated as universal and fundamental benchmark of the tested technologies. There are too many variables that could influence the tests and make them more case specific, like:

chosen test cases data model used hardware specification and configuration software stack used for data processing and its configuration/tuning


Performance comparison of different file formats and storage engines in the Apac ...
Description of the test: Measuring the average record size after storing the same data sets (millions of records) using different techniques and compressions


主题: HBaseHadoopHDFSMapReduceHiveSQLSparkJVMTIUT
本文标题:Performance comparison of different file formats and storage engines in the Apac ...

技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(8)