未加星标

Hadoop Weekly Issue #204

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二05 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

12 February 2017

The content in this week's issue is notable because it includes a number of projects at the periphery of the Hadoop ecosystem (pachyderm, MongoDB, RethinkDB, and Hazelcast). If you're just here for the Hadoop ecosystem news, don't worry, there is great coverage of Spark on YARN, PySpark and Apache Arrow, and Kafka Streams.

Technical

Pachyderm, which is a data lake that supports version control for data, includes a data processing API. This article and referenced sample code demonstrate how to use the python API to join two datasets, and it describes how pachyderm takes care of sharding and distributing the data as necessary.

https://medium.com/pachyderm-data/easy-distributed-joins-with-pachyderm-8307bab8a761

While configuring and monitoring a highly-available YARN cluster isn't the easiest thing, it does offer some clear advantages as a platform for running Spark. These include support for multiple versions of Spark, resource management, and resilience. The Altiscale blog elaborates on these and other benefits.

https://www.altiscale.com/blog/why-spark-on-yarn-and-not-standalone/

This post shares an analysis of MongoDB's replication and durability guarantees in the face of Jepsen testing (which introduces network partitions and other failure scenarios). MongoDB's previous replication system has inherent flaws, but a new replication system (based on Raft) fixes the fatal flaws. In fact, no major bugs related to lost updates, dirty reads, or stale reads were found in version 3.4.0 of MongoDB.

https://jepsen.io/analyses/mongodb-3-4-0-rc3

This presentation describes how Apache Arrow speeds up the bridge between python and the JVM for PySpark. The talk starts by describing how PySpark interacts with the Spark execution engine, and it then describes some of the improvements that already exist and some speed ups that are coming down the pike.

http://www.slideshare.net/wesm/improving-python-and-spark-pyspark-performance-and-interoperability

This post demonstrates using Pivotal Cloud Foundry to launch a PySpark application to train a linear model and to launch a python Flask application to serve predictions based on the trained model coefficients.

https://content.pivotal.io/blog/operationalizing-pyspark-data-science-models-on-pivotal-cloud-foundry

Cloudera has a post demonstrating analysis of flight data with sparklyr, the Spark-based backend for dplyr.

http://blog.cloudera.com/blog/2017/02/analyzing-us-flight-data-on-amazon-s3-with-sparklyr-and-apache-spark-2-0/

AWS has published scripts for importing Hive table definitions into Amazon Athena (the Presto-based, hosted big data query engine), and a blog post that describes how to use them.

https://aws.amazon.com/blogs/big-data/migrate-external-table-definitions-from-a-hive-metastore-to-amazon-athena/

The Databricks blog has an example of using the Intel BigDL project for deep learning with Apache Spark. The post describes how to get started, including training the model and evaluating the predictions it makes.

https://databricks.com/blog/2017/02/09/intels-bigdl-databricks.html

This tutorial builds on a Kafka Streams application that consumes demographic data about countries to create a streaming calculation of the top-3 countries by population within each continent.

https://technology.amis.nl/2017/02/12/apache-kafka-streams-running-top-n-grouped-by-dimension-from-and-to-kafka-topic/

News

The first two chapters of "Data Science on the Google Platform" are available as part of the O'Reilly early release.

http://shop.oreilly.com/product/0636920057628.do

The Cloud Native Computing Foundation (CNCF) has purchased the rights to RethinkDB, and they have re-licensed it under the Apache License. This adds a strong distributed database system to the CNCF portfolio of hosted projects which includes Fluentd and Kubernetes.

https://www.joyent.com/blog/the-liberation-of-rethinkdb

Apache Ranger, the security system for the Hadoop ecosystem, has graduated to be a top-level project. The announcement has a good overview of the features Ranger provides as well as some companies that are using it.

https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces3

Hadoop as a Service vendor Qubole has announced that they're now SOC 2 Type II compliant.

http://www.qubole.com/blog/qubole-successfully-completes-soc-2-type-ii-examination/

Spark Summit East was last week in Boston. This post has summaries of and links to videos/slides from a number of sessions at the Summit.

https://databricks.com/blog/2017/02/09/spark-summit-east-2017-another-record-setting-spark-summit.html

Releases

The Hortonworks blog has a recap of the features in the recently released Apache Zeppelin 0.7.0. Key features include improvements to multi-user support, a new pluggable visualzation, and spark improvements (adding support for Spark 2.1).

http://hortonworks.com/blog/welcome-apache-zeppelin-0-7-0

Apache Flink 1.2.0 was released. It's a giant release that resolves over 650 issues but maintains backwards compatibility with all public apis. This post has an overview of the key features, including dynamic scaling of streaming jobs (by restoring from a savepoint), support for running Flink with Apache Mesos, experimental support for encryption in transit, and major improvements to the Table API.

http://flink.apache.org/news/2017/02/06/release-1.2.0.html

In addition to their compliance news, Qubole has announced the general availability of the Qubole Data Service on the Oracle Bare Metal Cloud Service.

http://www.qubole.com/blog/product/now-generally-available-qds-on-oracle-bare-metal-cloud-service/

MapR has announced the MapR Converged Data Platform for Docker, which provides a mechanism for running Docker containers atop of MapR. Using the MapR file system and MapR streams, microservices can be relocated to another server in the cluster without losing any state.

https://www.mapr.com/blog/persistence-age-microservices-introducing-mapr-converged-data-platform-docker

Apache Beam 0.5.0 was released this week. This release adds new apis for State and Timers. According to the JIRA release notes, over 25 bugs were resolved and 25+ improvements and new features are part of the new version.

http:[email protected]/msg03687.html

Hazelcast, makers of the in-memory data grid of the same name, have announced a new open-source distributed processing sysem called Hazelcast Jet. Jet uses cooperative multithreading to take advantage of multi-core CPUs, implements distributed support for java.util.stream, and more. The code is available on GitHub.

https://dzone.com/articles/introducing-hazelcast-jet

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES California

#SDBigData Meetup #20 (San Diego) - Wednesday, February 15

https://www.meetup.com/sdbigdata/events/236400797/

55th Bay Area Hadoop User Group Meetup (Sunnyvale) - Wednesday, February 15

https://www.meetup.com/hadoop/events/237178448/

Big-Data-As-A-Service: Big Data Analytics on AWS (Santa Clara) - Wednesday, February 15

https://www.meetup.com/Big-Data-as-a-Service/events/237209923/

Stream Processing with Apache Kafka & Apache Samza (Sunnyvale) - Thursday, February 16

https://www.meetup.com/Stream-Processing-Meetup-LinkedIn/events/237171557/

Mining Member Feedback to Improve the Customer Experience: Nishant Hegde from Netflix (Culver City) - Thursday, February 16

https://www.meetup.com/Los-Angeles-Big-Data-Users-Group/events/236711651/ Washington

Data in Motion with Open Source Apache NiFi (Bellevue) - Wednesday, February 15

https://www.meetup.com/Big-Data-Bellevue-BDB/events/237387396/ Illinois

Keeping Spark on Track: Best Practices Using Apache Spark in Production (Chicago) - Monday, February 13

https://www.meetup.com/acm-chicago/events/237315991/ Virginia

Hybrid Transactional/Analytical Processing Using Spark & In-Memory Data Fabrics (Tysons) - Thursday, February 16

https://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/237076470/ New York

Introduction to Sendence Wallaroo: An Industrial-Grade Streaming Data Platform (New York) - Thursday, February 16

https://www.meetup.com/New-York-City-Storm-User-Group/events/237318240/ CANADA

Distributed Redundant Queueing with Apache Kafka (Kitchener) - Wednesday, February 15

https://www.meetup.com/Intersections-KW/events/236375299/ GERMANY

Apache Flink Meetup (Berlin) - Thursday, February 16

https://www.meetup.com/Apache-Flink-Meetup/events/236896351/ AUSTRIA

Hadoop User Group Meetup (Vienna) - Tuesday, February 14

https://www.meetup.com/Hadoop-User-Group-Vienna/events/236873308/ CZECH REPUBLIC

How It Works at Hortonworks (Prague) - Thursday, February 16

https://www.meetup.com/CS-HUG/events/237217644/ ISRAEL

Resilient Events Handling & Kappa Architecture (Herzeliyya) - Wednesday, February 15

https://www.meetup.com/Big-things-are-happening-here/events/237348726/ SOUTH AFRICA

Apache Kafka at Takealot.com: Use Cases and Production Considerations (Cape Town) - Wednesday, February 15

https://www.meetup.com/meetup-group-cxuwulGL/events/237157052/

本文数据库(综合)相关术语:系统安全软件

主题: SparkHadoopKafkaMongoDBDockerHiveKubernetesRIAJVMGitHub
分页:12
转载请注明
本文标题:Hadoop Weekly Issue #204
本站链接:http://www.codesec.net/view/534297.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(49)