未加星标

Updated Version of the Deployment Guide for Hadoop on VMware vSphere

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二03 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏

The new Deployment Guide for Virtualizing Hadoop on VMware vSphere describes the technical choices for running Hadoop and Spark-based applications in virtual machines on vSphere. Innovative technologies and design approaches are appearing very regularly in the big data market; the pace of innovation has not slowed down for sure!

A prime example of this innovation is the rapid growth in Spark adoption for serious enterprise work over the past year or so, overtaking MapReduce as the dominant way of building big data applications. Spark holds out the promise of faster application execution times and easier APIs to use to build your application. A lot of innovation work is now going into optimizing the streaming of large quantities of data into Spark, with an eye to the large data feedsthat will appear from connected cars and other devices in the near future. This new version of the VMware Deployment Guide for Hadoop on vSphere brings the informationup to date with developments in the Spark and YARN (“Yet Another Resource Negotiator”) areas.

The YARNtechnology is the general name for the updated job scheduling and resource management functions that have now become mainstream in Hadoop deployments. The older MapReduce-centric style, once the central resource management schedulerin Hadoop, is now relegated to just another programming framework. MapReduceis stillused for Extract-Transform-Load (ETL) jobs, running in batch mode on a common resource management and schedulingplatform (YARN) butnow,to a large extent,MapReduceis no longer the dominant paradigm for building applications. Spark is seen as muchmore suited to interactive queries and applications. Spark also runs as an example of another application framework on YARN, and that combination is popular in enterprises today and so it is the focus of much of our testing currently, as you will see. Spark runs in standalone mode outside of the YARN resource manager context too, but that option is out of scope for the current Deployment Guide, as we see that less often within enterprises today. Of course, that may changein the future.

The previous (2013) version of the Hadoop Deployment guide for vSphere described the Hadoop 1.0 concepts (TaskTracker, JobTracker, etc.,) as they are mapped into virtual machines. That earlier version also contained a wide set of technical choices for the core architecture decisions you need to make. In the new version, the concepts in modern big data such as Spark and YARN are described in a virtualization context.

In the new version, we brought the main design approaches down to two or three (for example choosing DAS or NAS in the storage area) and we extracted the more complicated designs and tool discussions from it, so as to make it more readable and more focused on getting you started. The ideas described here will scale up to hundreds of nodes if you so choose, so they can be used in the large scale too, if you are going that way. That is shown in the medium-size and large scale example deployments that are given in the guide.

You can think of this blog article as a quick shortcut to information in the Deployment Guide.

The main choices to be made at an early stage in considering the deployment Hadoop on vSphere are given below.

These discussion points (apart from the VM sizing and placement ones) are not unique to virtualization and they apply equally in native systems too:

Having identified how much data our new systems will manage, an early question is what type of storage to use. This question can be answered in several ways. An important choice is what type of storage to use. The Deployment Guide explores the use of Direct-Attached Storage (DAS) or an external form of storage for HDFS or a combination; Whether to use an external storage mechanism (e.g. Isilon NAS) that removes the management of the HDFS data from the now “compute-only” nodes or virtual machines What Hadoop software/services to place into different types of virtual machines How to size and map the correct number of virtual machines onto the right number of host vSphere servers. How to configure your networking so that the load that Hadoop occasionally places on it can be handled well. How to handle and recover from failures and assure availability of your Hadoop clusters.

The set of questions related to data storage come down toa core decision between dispersing your data out across multiple servers or retaining it on one central device. There are advantages to each of these.


Updated Version of the Deployment Guide for Hadoop on VMware vSphere

The dispersed storage model (Option 1 above) allows you to use commodity servers and storage devices, but it means you have to manage it all using your own tools. If a drive or storage device fails in this scheme, then it is the system administrator’s task to find it,fix it and restore it into the cluster. The centralized model ensures that all of your data is protected in one place and it may cut down on your overall storage needs. This reduction is due to avoiding the replication factor that applies with DAS-based HDFS.It can also make the data easier to manage from an ingestion and multi-protocol point of view. The Deployment Guide shows that both of these models will work fine with vSphere, using somewhat different architectures.

One other variant in storage is to use All-Flash storage on the servers in a similar fashion to DAS. This approach allows us to consider using Virtual SAN for hosting the entire Hadoop cluster, where earlier hybrid storage lent itself better to hosting the Hadoop Master nodes on the Virtual SAN-controlled storage. This All-Flash design for Hadoop on vSphere with VSAN is documented in a separate white paper from Intel and VMware.

Virtual Machine Placement

When taking your decisions about the placement of virtual machine onto servers, users have a distinct advantage in vSphere deployments. We don’t typically know about the server hardware configuration and the storage setup that our virtual machines will be deployed on, in many public clouds. That anonymity is where the flexibility of the public cloud comes from. Correct VM placement onto host servers and storage is very important for Hadoop/Spark however, as VM sizing and subsequent placement can have a profound influence over your application’s performance. That phenomenon is shown in the varied performance work that VMware has carried out on virtualized Hadoop most recently in the testing of Spark and Machine Learning workloads on vSphere in particular. An example of the results from that work is givenhere

Other topics that are discussed in the Hadoop Deployment Guide are: system availability, networking, and big data best practices. There is also a set of example deployments at the small, medium and large-sized levels for Hadoop clusters. These are all in use either at VMware or at other organizations. You can start out with a small Hadoop cluster on vSphere and expand it upwards over time into the hundreds of servers, if needed.

There is a significant set of technical reference material also contained in the References section of the Hadoop on vSphere Deployment Guide that helps you delve into the deeper details on any of the topics covered in the guide. You can take one of the models described in the main text of the guide, or in the references section as your starting point for deployment and follow the guidelines from there. Using your Hadoop vendor’s deployment tool is recommended for your cluster, whether it be your first one or one a

本文数据库(综合)相关术语:系统安全软件

主题: HadoopSparkMapReduceHDFS
分页:12
转载请注明
本文标题:Updated Version of the Deployment Guide for Hadoop on VMware vSphere
本站链接:http://www.codesec.net/view/530626.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(54)