So far I’ve written articles on Google BigQuery ( 1 , 2 , 3 , 4 , 5 ), on cloud-native economics( 1 , 2 ), and even on ephemeral VMs ( 1 ). One product that really excites me is Google Cloud Dataproc ― Google’s managed Hadoop, Spark, and Flink offering. In what seems to be a fully commoditized market at first glance, Dataproc manages to create significant differentiated value that bodes to transform how folks think about their Hadoop workloads.

Jobs-first Hadoop+Spark, not Clusters-first

Typical mode of operation of Hadoop ― on premise or in cloud ― require you deploy a cluster, and then you proceed to fill up said cluster with jobs, be it MapReduce jobs, Hive queries, SparkSQL, etc. Pretty straightforward stuff.

Why Dataproc ― Google’s managed Hadoop and Spark offering is a game changer.
The standard way of running Hadoop andSpark.

Services like Amazon EMR go a step further and let you run ephemeral clusters, enabled by separation of storage and compute through EMRFS and S3. This means that you can discard your cluster while keeping state on S3 after the workload is completed.

Google Cloud Platform has two critical differentiating characteristics:

Per-minute billing (Azure has this as well) Very fast VM boot up times

When your clusters start in well under 90 seconds (under 60 seconds is not unusual), and when you do not have to worry about wasting that hard-earned cash on your cloud provider’s pricing inefficiencies, you can flip this cluster->jobs equation on its head. You start with a job, and you acquire a cluster as a step in job execution.

If you have a MapReduce job, as long as you’re okay with paying the 60 second initial boot-up tax, rather than submitting the job to an already-deployed cluster, you submit the job to Dataproc, which creates a cluster on your behalf on-demand. A cluster is now a means to an end for job execution.

Why Dataproc ― Google’s managed Hadoop and Spark offering is a game changer.
Demonstration of my exquisite art skills, plus illustration of the jobs before clusters concept realized with Dataproc.

Again, this is only possible with Google Dataproc, only because of:

high granularity of billing (per-minute) very low tax on initial boot-up times separation of storage and compute (and ditching HDFS as primary store).

Operational and economic benefits are obvious and easily realized:

Resource segregation though tenancy segregation avoids non-obvious bottlenecks and resource contention between jobs. Simplicity of management ― no need to actually manage the cluster or resource allocation and priorities through things like YARN resource manager. Your dev/stage/prod workloads are now intrinsically separate ― and what a pain that is to resolve and manage elsewhere! Simplicity of pricing ― no need to worry about rounding up to nearest hour. Simplicity of cluster sizing ― to get the job done faster, simply ask Dataproc to deploy more resources for the job. When you pay per-minute, you can start thinking in terms of VM-minutes. Simplicity of troubleshooting ― resources are isolated, so you can’t blame the other tenants on your problems.

I’m sure I’m forgetting others. Feel free to leave a comment here to add color. Best response gets a collectors’ edition Google Cloud Android figurine!

Dataproc is as close as you can get to serverless and cloud-native pay-per-job with VM-based architectures ― across the entire cloud space. There’s nothing even close to it in that regard.

Dataproc does have a 10-minute minimum for pricing. Add the sub-90 second cluster creation timer, and you rule out many relatively lightweight ad-hoc workloads. In other words, this works for big serious batch jobs, not ad-hoc SQL queries that you want to run in under 10 seconds. I write on this topic here .(do let us know if you have a compelling use case that leaves you asking for less than a 10-minute minimum).

The rest of the Dataproc goodies

Google Cloud doesn’t stop there. There’s a few other benefits of Dataproc that truly make your life easier and your pockets fuller:

Custom VMs ― if you know the typical resource utilization profile of your job in terms of CPU/RAM, you can tailor-make your own instances with that CPU/RAM profile. This is really really cool, you guys. Preemptible VMs ― I wrote on this topic recently. Google’s alternative to Spot instances is just great. Flat 80% off, and Dataproc is smart enough to repair your jobs in case instances go away. I beat this topic to death in the blog post , and in my biased opinion it’s worth a read on its own. Best pricing in town. Google Compute Engine is the industry price leader for comparably-sized VMs. In some cases, up t0 40% less than EC2. Gobs of ephemeral capacity ― Yes, you can run your Spark jobs on thousands of Preemptible VMs, and we won’t make you sign a big commitment, as this gentleman found out (TL;DR: running 25,000 Preemptible VMs). GCS is fast fast fast ― When ditching HDFS in favor of object stores, what matters is the overall pipe between storage and instances. Mr. Jornson details performance characteristics of GCS and comparable offerings here . Dataproc for stateful clusters Now if you are running a stateful cluster with, say Impala and Hbase on HDFS,


主题: SparkHadoopHDFSSQLCPUMapReduceAndroidHive
本文标题:Why Dataproc ― Google’s managed Hadoop and Spark offering is a game changer.

技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(25)