未加星标

Data Eng Weekly Issue #294

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二03 | 时间 2019 | 作者 红领巾 ] 0人收藏点击收藏

23 December 2018

It's a busy issue this week as lots of folks published articles before the year end. A few of the must reads include how the Guardian migrated from MongoDB to Postgres, an analysis of UDFs/builtins/map functions in Apache Spark, Facebook's work on HyperLogLog for Presto, and Hortonworks' post on a Container Storage Interface for Apache Hadoop YARN. There are also a lot of interesting tools to checkout―KubeDB for running databases on Kubernetes, the discrETLy project that provides a business dashboard for Apache Airflow, and the Redix key-value store.

Technical

The Guardian writes about how they migrated from MongoDB to Postgres (storing data in JSONB). The post offers a glimpse at their rollout and operations of migrating data, including how they replicated API calls to execute them on both databases and how they used structured logging. Lots of useful details if you're ever considering a similar database swap.

https://www.theguardian.com/info/2018/nov/30/bye-bye-mongo-hello-postgres

The Facebook blog has a post about the HyperLogLog (HLL) implementation in Presto that is used for computing approximate distinct counts. The post has a good introduction to the math behind HLL and why it can achieve big speedups.

https://code.fb.com/data-infrastructure/hyperloglog/

This post covers the various settings that can be used to tune the performance of an Apache Kafka cluster and its producers/consumers.

https://medium.com/@rinu.gour123/kafka-performance-tuning-ways-for-kafka-optimization-fdee5b19505b

Apache Kafka 2.1 was released just over a month ago. This post has a good overview of all the new features and improvements in the release (there are quite a few!).

https://medium.com/@stephane.maarek/new-features-of-kafka-2-1-33fb5396b546

Hortonworks has a post comparing the performance of Hive in HDP 2.6.5 with that in 3.0. The version in 3.0 adds ACID semantics but also performs nearly 2x faster (in their TCP-DS analyses) thanks to dynamic runtime filtering and vectorization.

https://hortonworks.com/blog/2x-faster-bi-interactive-queries-with-hdp-3-0/

The Container Storage Interface (CSI) is an API standard for building pluggable storage volumes for distributed systems like YARN and Kubernetes. This post shows how to use HDFS, via the Apache Hadoop Ozone Object Store implementation and the csi-s3 project, as a CSI driver in YARN (or potentially other systems).

https://hortonworks.com/blog/open-hybrid-architecture-running-stateful-containers-on-yarn/

For batch / analytical workloads, there's a Hive to Kafka integration in the latest release of HDP. It's architected to integrate with Kafka's timestamps and offsets for time travel and other optimizations. The Hortonworks blog has a bunch more about the implementation.

https://hortonworks.com/blog/introducing-hive-kafka-sql/

The Streamlio blog describes how Apache Pulsar's architecture, which separates the storage and serving layers, contrasts to other streaming systems like Apache Kafka. Specific advantages include isolation of catch up reads from real-time tailing, and infinite storage by offloading cold data to a blob store like S3.

https://streaml.io/blog/apache-pulsar-architecture-designing-for-streaming-performance-and-scalability

Apache Hive 3.0 includes improved support for federating queries to JDBC sources, such as the ability to apply filters and projections in the source system. Apache Calcite is used for the query planning and can intelligently use source database-specific SQL features.

https://hortonworks.com/blog/query-federation-with-hive/

Facebook writes about their usage of zstandard, the efficient compression library, for use cases like ORC files, databases, and storing data in RocksDB.

https://code.fb.com/core-data/zstandard/

Cloudera and Intel collaborated on an implementation of vectorized reading of Apache Parquet data in Apache Hive. This post describes vectorization, the implementation, and the performance improvements that this work has realized (about 26% on average).

https://blog.cloudera.com/blog/2018/12/faster-swarms-of-data-accelerating-hive-queries-with-parquet-vectorization/

This article explores the various ways to implement custom logic in Apache Spark, such as UDFs, Map functions, and builtins. It analyzes the ease of implementation as well as the performance impact using some microbenchmarking. There's also a discussion of how each option impacts the query plan―for example can predicate pushdown still be used?

https://medium.com/@F4Qtech/udfs-vs-map-vs-custom-spark-native-functions-91ab2c154b44

Lyfts Airflow deployment covers 500 DAGs and executes on nearly 20 executor nodes. They share some tips and lessons learned from running Airflow at this scale, including how they use a "canary" health check to measure end-to-end availability, how they've improved DAG execution performance, and their Airflow customizations.

https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff

A good collection of tips for getting the most performance out of Apache Spark.

https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11

An interesting analysis of how high availability in a stream processing engine tends to add a lot of operational complexity, especially if you have to bring ZooKeeper and YARN into the mix. The author suggests an alternative version of HA―running the stream pipeline twice―to avoid bloat for a smaller use case.

https://medium.com/stream-processing/is-your-stream-processor-obese-d33936de00ff

Jobs Lead Distributed Data Engineer, Nomics, Remote https://jobs.dataengweekly.com/jobs/ffddf793-9772-4ff2-9c1a-aca3f02064c5

The end of the year finds lots of folks looking for a change! Post a job to the Data Eng Weekly job board for $99. https://jobs.dataengweekly.com/

News

DataStax Accelerate is taking place this May in National Harbor, Maryland. The Call For Papers is open through February 15th.

https://www.datastax.com/2018/12/share-your-knowledge-at-the-premier-apache-cassandra-conference

Datanami has compiled a collection of big data predictions for 2019, including the adoption of ML by enterprises, changes to real-time data processing thanks to 5G networks, additional edge computing, and slowing of growth of Hadoop.

https://www.datanami.com/2018/12/18/whats-up-with-2019-big-data-predictions/

Confluent has written a follow up post about the new Confluent Community License to address some frequent questions.

https://www.confluent.io/blog/developers-guide-confluent-community-license

Releases

KubeDB is a set of Kubernetes Custom Resource Definitions, CLI tools, and more for running databases on Kubernetes. It supports Postgres, Elasticsearch, mysql, MongoDB, and several other systems. Version 0.9.0 was released this week.

https://github.com/kubedb/cli/releases/tag/0.9.0

discreETLy is a new open source project from Fandom. It's a dashboard that presents a human-friendly (e.g. for business owners) dashboard of your Apache Airflow workflows by consumes the Airflow API. It's stateless and can be run as a Docker container.

https://medium.com/fandom-engineering/how-discreetly-helped-us-improve-trust-and-communication-between-tech-and-business-5e411d26a06f https://github.com/Wikia/discreETLy

Version 3.0 of wolkenkit, the CQRS and event sourcing framework for Node.js, has been released. Major new features include a rewritten file storage, authorization and security features, and additional CLI commands.

https://www.thenativeweb.io/blog/2018-12-04-18-17-introducing-wolkenkit-3-0/

Hadoop Submarine is a new project for building deep learning frameworks, like TensorFlow, on Apache Hadoop. It has integrations with Zeppelin and Azkaban for running jobs.

https://hortonworks.com/blog/submarine-running-deep-learning-workloads-apache-hadoop/

Cloudera Enterprise 6.1.0 is out, with updates across the entire suite of products. Their post has a summary of major features.

https://blog.cloudera.com/blog/2018/12/cloudera-enterprise-6-1-0-is-now-available/

Apache Flink 1.7.1 was released with a couple dozen bug fixes.

https://flink.apache.org/news/2018/12/21/release-1.7.1.html

Redix is a key-value store that implements the Redis protocol, persistence, and additional functionality like ACID transactions. Version 1.5 was just released.

https://github.com/alash3al/redix/releases/tag/v1.5

Events

Curated by Datadog ( http://www.datadog.com )

ISRAEL

#ApacheKafkaTLV Hosting Gwen Shapira (Tel Aviv-Yafo) - Tuesday, December 25

https://www.meetup.com/ApacheKafkaTLV/events/256775780/

Fullstack 2018 Hackathon: Boost Your MSA & Data Pipelines with Kafka Streams (Tel Aviv-Yafo) - Tuesday, December 25

https://www.meetup.com/full-stack-developer-il/events/256339545/ INDIA

Building Microservices with Reactive Architecture (Pune) - Saturday, December 29

https://www.meetup.com/ReactivePune/events/257025378/

本文数据库(综合)相关术语:系统安全软件

代码区博客精选文章
分页:12
转载请注明
本文标题:Data Eng Weekly Issue #294
本站链接:https://www.codesec.net/view/628158.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(116)