Introduction to Oracle Big Data Cloud Service – Compute Edition (Part VI)

I though I would stop writing about “Oracle Big Data Cloud Service – Compute Edition” after my fifth blog post, but then I noticed that I didn’t mention about the Apache Hive, another important component of the Big Data. Hive is a data warehouse infrastructure built on top of Hadoop, designed to work with large datasets. Why is it so important? Because it includes support for SQL (SQL:2003 and SQL:2011), and helps users to utilize existing SQL skillsets to quickly derive value from big data.

Although new improvements of Hive project enables sub-second query retrieval (Hive LLAP) but it’s not designed for online transaction processing (OLTP) workloads. Hive is best used for traditional data warehousing tasks.

In this blog post, I’ll demonstrate how we can import data from CSV files into hive tables, and run SQL queries to analyze the date stored in these tables.

We can use a 3rd party Hive client (such as Zeppelin, HUE, DBeaver), ambari Hive view or Hive command line to connect Hive server. I’ll use Zeppelin notebook although it has some limitations when running with Hive: Each block can only run one hive SQL (you can not run several queries by separating them with a semicolon), and in one notebook, you can run up to 10 queries.

If you already read my previous blog posts about the Oracle Big Data Cloud Service – Compute Edition, you probably remember that I used a dataset about flights. This time, I’ll use a different dataset. I found movielens database (a movie recommendation service) on the net, and it’s much better than the one I used in my previous posts. It contains movie information and ratings/votes for each movie. It’s very clean and simple to understand.

First, I create a new Zeppelin notebook named Hive (its name doesn’t matter), and in first code block, I write a simple bash scrip to download and unzip the movielens database. Here’s the code:

%sh
cd /tmp/

rm ml-latest.zip

rm -rf /tmp/ml-latest

wget http://files.grouplens.org/datasets/movielens/ml-latest.zip

unzip ml-latest.zip

In case you missed my blog post about Zeppelin, the first statement (%sh) indicates that this is a shell block, so Zeppelin uses shell interpreter to run the content of the block. When we run the above block, the ml-latest.zip will be downloaded and unzipped into the “/tmp/ml-latest” directory. We’ll only use “movies.csv” and “ratings.csv”.

%hive
DROP TABLE IF EXISTS ratings

%hive
DROP TABLE IF EXISTS movies

We need to create the movies and ratings table in Hive, but first we need to drop them if they are already exist. You can skip this step, but it’s useful to put the above blocks so our notebook can be re-run without any errors.

%hive
CREATE TABLE ratings ( userId DECIMAL, movieId DECIMAL, rating DECIMAL, rating_timestamp BIGINT ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
TBLPROPERTIES("skip.header.line.count"="1")

%hive
CREATE TABLE movies ( movieId DECIMAL, title STRING, genres STRING ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
TBLPROPERTIES("skip.header.line.count"="1")

The above blocks will create the Hive tables to hold the movies and ratings data. As you can see, the ratings table has 4 columns (userId, movieId, rating, timestamp) and the movies table has 3 columns (movieId, title, genres). “skip.header.line.count” lets us skip the first line (the header of CSV).

We have various options to import data to these tables. For example, we can use “PIG” to import the data. We could also create these tables as “external tables” with pointing the CSV files, but I’ll use Hive’s native LOAD command to import the data.

%hive
LOAD DATA LOCAL INPATH '/tmp/ml-latest/ratings.csv' OVERWRITE INTO TABLE ratings

%hive
LOAD DATA LOCAL INPATH '/tmp/ml-latest/movies.csv' OVERWRITE INTO TABLE movies

The above two blocks will import the data to the Hive files. The keyword “LOCAL”, tells that we read from a local directory on the server. If we want to read from an hdfs directory, we need to remove the keyword “LOCAL”. The keyword “OVERWRITE”, tells that we want to replace the existing data. If we don’t use that keyword, it appends new data to the existing data.

By the way, Hive blocks are special blocks like Spark SQL blocks and their outputs are shown as data grid, and they come with buttons to generate graphs. On the other hand, the recent queries we executed, do not provide a resul set. So the output won’t be fancy and all we see is “an update count equals -1”. Ignore this -1 result. It’s not an error code. If you get a real error, Hive will return a clean error message.

So we imported the data to our hive tables, we can use traditional SQL queries to fetch data from the tables. I tried to find the top 10 “Thriller” movies which are voted by at least 20.000 users. Here’s the code of the above block:

%hive
SELECT m.title, AVG(r.rating) avg_rating, COUNT(*) total_votes
FROM movies AS m, ratings AS r 
WHERE m.movieid = r.movieid
and m.genres LIKE '%Thriller%'
GROUP BY m.movieid, m.title
HAVING COUNT(*) >= 20000
ORDER BY avg_rating DESC
LIMIT 10

Have you noticed that there is a problem with the average ratings? All of them are “4”. First I thought it’s a problem about the data type of the rating column, but It seems it’s just a bug of the current Zeppelin’s data grid.

I logged in to the cloud server, switch to hive user and issue the “hive” command to start the Hive command line. When I run the above query from the hive command line, I got the correct output.

As you see, Hive is a great software for traditional SQL guys who want to work with Big Data, and it’s ready to use with the Oracle Big Data Cloud Service – Compute Edition.

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part VI) – Hive

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112