Quantcast
Channel: Gokhan Atil – Gokhan Atil’s Blog
Viewing all articles
Browse latest Browse all 108

Query a HBASE table through Hive using PySpark on EMR

$
0
0

In this blog post, I’ll demonstrate how we can access a HBASE table through Hive from a PySpark script/job on an AWS EMR cluster. First I created an EMR cluster (EMR 5.27.0, Hive 2.3.5, Hbase 1.4.0). Then I connected to the master node, executed “hbase shell”, created a HBASE table, and inserted a sample row:

create 'mytable','f1'
put 'mytable', 'row1', 'f1:name', 'Gokhan'

I logged in to hive and created a Hive table which points to the HBASE table:

CREATE EXTERNAL TABLE myhivetable (rowkey STRING, name STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f1:name')
TBLPROPERTIES ('hbase.table.name' = 'mytable');

When I tried to access table using spark.table(‘myhivetable’), I got an error pointing that the org.apache.hadoop.hive.hbase.HBaseStorageHandler class was not found. I tried to use “–packages” parameter to get the required JAR library from maven repository. It downloaded a lot of missing jars but it did not work. So I downloaded the required JAR file using wget, and copied it to Spark’s JAR directory:

wget https://repo1.maven.org/maven2/org/apache/hive/hive-hbase-handler/2.3.5/hive-hbase-handler-2.3.5.jar
sudo cp hive-hbase-handler-2.3.5.jar /usr/lib/spark/jars/

I noticed that it requires some HBASE Jar files, so I copied them into the Spark’s JAR directory:

sudo cp /usr/lib/hbase/*.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hbase/lib/htrace-core-3.1.0-incubating.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hbase/lib/guava-12.0.1.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hbase/lib/metrics-core-2.2.0.jar /usr/lib/spark/jars/

After that, I tried to query the table and it worked:

>>> spark.table('myhivetable').show()
+------+------+
|rowkey|  name|
+------+------+
|  row1|Gokhan|
+------+------+

I tested the same method on an earlier EMR version (5.12.x) and I saw that it failed because the spark executors tried to connect to local Zookeeper instances (which do not exists on core/task nodes). If you got such an error, you can set “hbase.zookeeper.quorum” to your master node’s IP address (where the Zookeeper runs):

sc._jsc.hadoopConfiguration().set('hbase.zookeeper.quorum','172.31.41.174')

The IP (172.31.41.174) is the private IP of my master node. You can learn the private IP of your master node from the EMR console. It’s shown in master instance info in Hardware page. You can also connect to master node and run the following command:

nslookup `hostname`

I hope it helps. Please do not hesitate to let me know if you have any questions.


Viewing all articles
Browse latest Browse all 108

Trending Articles


FLASHBACK WITH SIRASA FM AT GALGAMUWA 2022


Mp3 Download: Mdu - Mazola


Imitation gun was fired at motorist in Leicester road-rage incident


Ndebele names


MCKINNEY EMALINE “EMMA” OF WES...


Okra & Motia — The Workshop (Prod by Hammer)


Skint TV teen to be sentenced


Moondru Mudichu 19-09-2017 – Polimer tv Serial


YOSVANI JAMES Arrested by Miami-Dade County Corrections on Jan 10, 2017


Stories • Goddess Stepmom