Query a HBASE table through Hive using PySpark on EMR

In this blog post, I’ll demonstrate how we can access a HBASE table through Hive from a PySpark script/job on an AWS EMR cluster. First I created an EMR cluster (EMR 5.27.0, Hive 2.3.5, Hbase 1.4.0). Then I connected to the master node, executed “hbase shell”, created a HBASE table, and inserted a sample row:

create 'mytable','f1'
put 'mytable', 'row1', 'f1:name', 'Gokhan'

I logged in to hive and created a Hive table which points to the HBASE table:

CREATE EXTERNAL TABLE myhivetable (rowkey STRING, name STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f1:name')
TBLPROPERTIES ('hbase.table.name' = 'mytable');

When I tried to access table using spark.table(‘myhivetable’), I got an error pointing that the org.apache.hadoop.hive.hbase.HBaseStorageHandler class was not found. I tried to use “–packages” parameter to get the required JAR library from maven repository. It downloaded a lot of missing jars but it did not work. So I downloaded the required JAR file using wget, and copied it to Spark’s JAR directory:

wget https://repo1.maven.org/maven2/org/apache/hive/hive-hbase-handler/2.3.5/hive-hbase-handler-2.3.5.jar
sudo cp hive-hbase-handler-2.3.5.jar /usr/lib/spark/jars/

I noticed that it requires some HBASE Jar files, so I copied them into the Spark’s JAR directory:

sudo cp /usr/lib/hbase/*.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hbase/lib/htrace-core-3.1.0-incubating.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hbase/lib/guava-12.0.1.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hbase/lib/metrics-core-2.2.0.jar /usr/lib/spark/jars/

After that, I tried to query the table and it worked:

>>> spark.table('myhivetable').show()
+------+------+
|rowkey|  name|
+------+------+
|  row1|Gokhan|
+------+------+

I tested the same method on an earlier EMR version (5.12.x) and I saw that it failed because the spark executors tried to connect to local Zookeeper instances (which do not exists on core/task nodes). If you got such an error, you can set “hbase.zookeeper.quorum” to your master node’s IP address (where the Zookeeper runs):

sc._jsc.hadoopConfiguration().set('hbase.zookeeper.quorum','172.31.41.174')

The IP (172.31.41.174) is the private IP of my master node. You can learn the private IP of your master node from the EMR console. It’s shown in master instance info in Hardware page. You can also connect to master node and run the following command:

nslookup `hostname`

I hope it helps. Please do not hesitate to let me know if you have any questions.

Query a HBASE table through Hive using PySpark on EMR

Trending Articles

LAG, Lacp configuration on Mellanox switches

Karimnagar District Police Office Mobile Numbers List in Telangana State

Ifield Avenue closed following crash in Langley Green

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Black Angus Grilled Artichokes

Derbyshire jeweller and scrap gold dealer, Jonathan Haag, must pay £57,000...

BREAKING NEWS: Park Street closed off after fire

Practice Sheet of Right form of verbs for HSC Students

The 10 Tennessee Cities With The Largest Black Population For 2021

FLASHBACK WITH SIRASA FM AT GALGAMUWA 2022

Mp3 Download: Mdu - Mazola

Imitation gun was fired at motorist in Leicester road-rage incident

Ndebele names

MCKINNEY EMALINE EMMA OF WES...

Okra & Motia — The Workshop (Prod by Hammer)

Skint TV teen to be sentenced

Moondru Mudichu 19-09-2017 – Polimer tv Serial

YOSVANI JAMES Arrested by Miami-Dade County Corrections on Jan 10, 2017

Stories • Goddess Stepmom