未加星标

Hadoop Weekly Issue #199

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二03 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

08 January 2017

This week's edition is relatively short, but contains some great posts on Hadoop+S3, Hadoop+python, first-class asynchronous processing with Samza, and two new tools for Kafka and ZooKeeper. There are also a couple of year end/ahead posts from Datanami and Databricks plus a great interview with one of the authors of Cassandra, the Definitive Guide.

Technical

This post has a list of important settings and practices for using Amazon S3 with Apache Hadoop and Apache Spark. Following these tips should improve performance and work around some of the problems associated with S3's blob store semantics.

https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98

Python data tool support for interacting with Hadoop and the larger ecosystem has drastically improved in 2016. This first post describes the strides made (and plans for 2017) with Apache Arrow, Apache Parquet, the Feather file format, PySpark, and Ibis. The second post looks at the performance and maturity of several python libraries for reading data from HDFS.

http://wesmckinney.com/blog/outlook-for-2017/ http://wesmckinney.com/blog/python-hdfs-interfaces/

MapR has the second part of a blog post that walks through using Spark's k-means machine learning algorithm to do real-time clustering of Uber data. The first post focussed on model creation, and this post adds a Spark Streaming job to apply the classifications and then a second job to produce a dataframe for analysis using Spark SQL.

https://www.mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2

While much of the recent news related to stream processing has recently focussed on Spark, Flink, and Kafka, the Apache Samza project continues to be used by LinkedIn and other companies. While these other stream processing systems simplify the programming model to be synchronous and stream/event-based, Samza is experimenting with a different, asynchronous model. In it, callbacks are used to efficiently support RPCs and other asynchronous operations. To support these semantics, Samza has implemented an event loop, which is described in this post on LinkedIn's engineering blog.

https://engineering.linkedin.com/blog/2017/01/asynchronous-processing-and-multithreading-in-apache-samza--part

MapR has a tutorial that describes using Spark to run images through the Tesseract open-source OCR engine and storing the parsed text in an ElasticSearch index.

https://www.mapr.com/blog/processing-image-documents-mapr-scale

The morning paper is going to be covering some great distributed systems papers (including Apache Hadoop YARN) this week. In preparation, this post has links to several pieces of background reading from previous posts.

https://blog.acolyer.org/2017/01/08/a-distributed-systems-seminar-reading-list-spring-2017-edition/

News

Datanami has an article containing 2017 outlooks from a number of big data industry executives. There are quite a variety of opinions, including that Hadoop will take off (and die off) and that 2017 is the year that BI analytics will finally deliver.

https://www.datanami.com/2017/01/04/2017-more-big-data-predictions/

The Databricks blog summarizes some of the major accomplishments and milestones that Spark and Databricks hit in 2016. These include support for SQL-2003, the CloudSort Record, and Structured Streaming.

https://databricks.com/blog/2017/01/04/databricks-and-apache-spark-year-in-review.html

Confluent has their monthly Log Compaction newsletter that includes coverage of current Kafka Improvement Proposals (including proposals for global tables and single message transformations in Kafka Connect) and several hand-picked articles and presentations.

https://www.confluent.io/blog/log-compaction-highlights-apache-kafka-stream-processing-community-january-2017/

This post describes how Google Cloud Platforms's per-minute billing and fast boot times allow you to build a job-first data pipeline, rather than a cluster first one. While other cloud vendors offer similar setups (Amazon EMR is the most notable one), this article highlights some of the competitive advantages (i.e. fast ssds, cheap preemptive vms) that Google offers.

https://hackernoon.com/why-dataproc-googles-managed-hadoop-and-spark-offering-is-a-game-changer-9f0ed183fda3

After six years, there's a new edition of Cassandra, the Definitive Guide. InfoQ has an interview with the book's co-author Jeff Carpenter about what's new in the book (it covers up through Cassandra 3.0), some of the new features in recent Cassandra releases, Cassandra's multi-datacenter support, integration with Spark and other ecosystem projects, and more

https://www.infoq.com/articles/cassandra-2nd-edition-book-review

The Call For Papers for Kafka Summit New York, which takes place in May, closes in just over a week. The conference tracks are Systems, Streaming Data Pipelines, and Stream Processing.

https://kafka-summit.org/kafka-summit-ny/speakers/

Releases

For Apache Kafka cluster operations, this project provides a script to analyze cluster state to determine which brokers may be responsible for under-replicated partitions.

https://github.com/wushujames/kafka-utilities

Burry is a new tool for performing backups (and restores) of Apache ZooKeeper, etcd, and Consul to local, blob storage (such as Amazon S3), and more.

https://github.com/mhausenblas/burry.sh

Events

Curated by Datadog ( http://www.datadog.com )

UNITED STATES California

Spark SQL: 10 Things You Need to Know (San Diego) - Tuesday, January 10

https://www.meetup.com/San-Diego-Spark-and-Big-Data-Meetup/events/232713087/

Apache Spark Meetup @ Workday (San Francisco) - Tuesday, January 10

https://www.meetup.com/spark-users/events/236220442/

DevOps for Data Science: Lifecycle of Big Data Analytics Services (San Francisco) - Wednesday, January 11

https://www.meetup.com/SF-Big-Analytics/events/235275403/

Airflow Meetup 1Q17 (San Francisco) - Wednesday, January 11

https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/235259523/

2017 Kickoff: Cloudera Lightning Talks (Palo Alto) - Wednesday, January 11

https://www.meetup.com/SFBay-Lucene-Solr-Meetup/events/236218287/

Tech Talk: Processing IoT Data with Apache Kafka (Mountain View) - Thursday, January 12

https://www.meetup.com/openvswitch/events/235615014/ Texas

The Apache Solr Smart Data Ecosystem (Plano) - Monday, January 9

https://www.meetup.com/DFW-Data-Science/events/235387179/

A Brief Introduction to Scala (San Antonio) - Tuesday, January 10

https://www.meetup.com/San-Antonio-Data-Science-Meetup/events/236191503/ Minnesota

Join Doug Cutting, the Creator of Hadoop, for Apache Hadoop: The Next 10 Years (Saint Paul) - Monday, January 9

https://www.meetup.com/Twin-Cities-Hadoop-User-Group/events/235740879/ Michigan

Lambda Architecture and Data Mining! (Grand Rapids) - Wednesday, January 11

https://www.meetup.com/Big-Data-and-Hadoop-Users-Group-of-West-Michigan/events/235787945/ Virginia

Introduction to Kafka Streams with a Real-Life Example (Tysons) - Wednesday, January 11

https://www.meetup.com/Apache-Kafka-DC/events/236376949/ MEXICO

Data as a Log + Asana Live Demo (Zapopan) - Wednesday, January 11

https://www.meetup.com/gdljug/events/236642157/ SPAIN

Splice Machine: Architecture of an Open Source RDBMS Powered by HBase and Spark (Barcelona) - Thursday, January 12

https://www.meetup.com/Spark-Barcelona/events/236569889/

本文数据库(综合)相关术语:系统安全软件

主题: HadoopSparkKafkaCassandraSQLSolrZooKeeperHDFSLuceneRPC
分页:12
转载请注明
本文标题:Hadoop Weekly Issue #199
本站链接:http://www.codesec.net/view/523036.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(94)