未加星标

Update an HBase table with Hive… or sed

字体大小 | |
[数据库(综合) 所属分类 数据库(综合) | 发布者 店小二05 | 时间 2016 | 作者 红领巾 ] 0人收藏点击收藏

This article explains how to edit a structured number of records in HBase by combining Hive on M/R2 and sed and how to do it properly with Hive.

HBaseis an impressive piece of software massively scalable, super-fast and built upon Google’s BigTable… and if anyone knows “How To Big Data”, it ought to be these guys.

But whether you are new to HBase or just never felt the need, working with it’s internals tends not to be the most trivial task. The HBase shell is only slightly better in terms ofusability than the infamous zkcli (Zookeeper Command Line Interface) and onceit comes to editing records , you’ll wind up with tedious, long and complex statements before you revert to using it’s (Java) API to fulfill your needs if the mighty Kerberos doesn’t stop you before.

But what if… there were a way that is both stupid and functional? What if it could be done without a programmer? Ladies and gentlemen, meet sed .

sed ( stream editor ) is a Unix utility that parses and transforms text, using a simple, compact programming language.

https://en.wikipedia.org/wiki/Sed

The scenario

Let’s say you have a table with customerrecordson HBase, logging all user’s interactions on your platform (your online store, your bookkeeping, basically anything that can collect metrics):

key cf:user cf:action cf:ts cf:client 142_1472676242 142 34 1472676242 1 999_ 1472683262 999 1 1472683262 2

Where: cf:user is the user’s id, cf:action maps toa page on your system (say 1 quals login, 34 equals a specific screen), cf:ts is the interaction’s timestamp and cf:client is the client, say 1 for OS X, 2 for windows, 3 for linux.

Let’s ignore the redundancy in timestamp and user or the usability of the key let’s stick to simple design for this.

One of your engineers (surely not you!) f@#!ed up and your mapping for cf:client between 2016-08-22 and 2016-08-24, every record indicating “client” is one off! 1 is now 0 (making me using an NaN OS) and 2 is now 1 and maps to OS X, 2 to Windows and suddenly, Linux is of the charts. But don’t panic Hive to the rescue!

Set the stage

Here’s what we’ll use:

Hadoop 2.6.0 (HDFS + YARN) HBase 1.2.2 Hive 2.1.0 A custom create script for the data OS X 10.11.6

And here’s what we’re going to do: First, we’ll create our HBase table.

hbase shell
create 'action', 'cf', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
Update an HBase table with Hive… or sed

Bam, good to go! The table’s empty but we’ll get to that in a second.

In order to do anything with our view, we need data. I’ve written a small Spark app that you can get here and run as such:

git clone ...
mvn clean install
cd target
spark-submit \
--class com.otterinasuit.spark.datagenerator.Main \
--master localhost:9000 \
--deploy-mode client \
CreateTestHBaseRecords-1.0-SNAPSHOT.jar \
2000000 \ # number of records
4 # threads

Let’s run it with a charming 2,000,000 records and see where that goes. On my mid-2014 MacBook Pro with i5 and 8GB of Ram, the job ran for 9.8 minutes. Might want to speed that up with a cluster of choice!

If you want to check the results, you canuse good ol’ mapreduce:

hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'action'

Alternatively:

hbase shell
count 'action', INTERVAL=>100000
Update an HBase table with Hive… or sed

Granted, you would not have needed Spark for this, but Spark is awesome! Also,it’s overhead in relation to it’s performance benefits beats shell- or python scripts doing the same job in many daily use cases (surely not running on a tiny laptop, but that’s not the point). I was forced to use some semi-smart approaches (Spark mllib for Random data and changing the type with a map()-function might be one) but hey, it works!

Next, an external Hive view the baseline for all our Hive interactions with HBase.

CREATE DATABASE IF NOT EXISTS test;
CREATE EXTERNAL TABLE test.action(
key string
, userf int
, action int
, ts bigint
, client int
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "
:key
, cf:user#b
, cf:action#b
, cf:ts#b
, cf:client#b
"
)
TBLPROPERTIES("hbase.table.name" = "action");

Unless you are trying to abuse it (like we’re about to do!), Hive is a really awesome tool. Its HQL, SQL-like query language comes really natural to most developers and it can be used in a big variety of sources and for an even bigger variety of use cases. Hive is, however, a tool that operates on top of MapReduce2 by default and with that, it comes with a notable performance overhead. So do yourself a favor and have a look at Impala , Tez and Drill or running Hive on Spark. But let’s abuse Hive with M/R2 a bit more first.

Hive magic

We now have an HBase table, a Hive external table and data. Let’s translate our problem from before to HQL. In the statement,1471824000 equals 2016-08-22(midnight),1472083199 is 2016-08-24 (23:59:59).

UPDATE test.action SET client = client+1 WHERE ts >= 1471824000 AND ts <= 1472083199;

But what’s that?

FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations.

The reason why this doesn’t work is simple:

If a table is to be used in ACID writes (insert, update, delete) then the table property “transactional” must be set on that table, starting with Hive 0.14.0. Without this value, inserts will be done in the old style; updates and deletes will be prohibited.

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Apart from setting a table property, we will also need to edit the Hive configuration and change the default values for hive.support.concurrency to true and hive.exec.dynamic.partition.mode to nonstrict . That’s not only something that would usually involve several people (not good for our unlucky engineer), it also requires you to have the table in ORC format not for us.

But one thing Hive can also do is to create managed tables . A managed table (instead of an external t

本文数据库(综合)相关术语:系统安全软件

主题: HiveHBaseSparkWindowsLinuxSQLHadoopHDFSMapReduceMacBook
分页:12
转载请注明
本文标题:Update an HBase table with Hive… or sed
本站链接:http://www.codesec.net/view/484333.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 数据库(综合) | 评论(0) | 阅读(32)