Quantcast
Channel: Gokhan Atil – Gokhan Atil’s Blog
Viewing all articles
Browse latest Browse all 108

Introduction to Oracle Big Data Cloud Service – Compute Edition (Part VI) – Hive

$
0
0

I though I would stop writing about “Oracle Big Data Cloud Service – Compute Edition” after my fifth blog post, but then I noticed that I didn’t mention about the Apache Hive, another important component of the Big Data. Hive is a data warehouse infrastructure built on top of Hadoop, designed to work with large datasets. Why is it so important? Because it includes support for SQL (SQL:2003 and SQL:2011), and helps users to utilize existing SQL skillsets to quickly derive value from big data.

Although new improvements of Hive project enables sub-second query retrieval (Hive LLAP) but it’s not designed for online transaction processing (OLTP) workloads. Hive is best used for traditional data warehousing tasks.

In this blog post, I’ll demonstrate how we can import data from CSV files into hive tables, and run SQL queries to analyze the date stored in these tables.

We can use a 3rd party Hive client (such as Zeppelin, HUE, DBeaver), ambari Hive view or Hive command line to connect Hive server. I’ll use Zeppelin notebook although it has some limitations when running with Hive: Each block can only run one hive SQL (you can not run several queries by separating them with a semicolon), and in one notebook, you can run up to 10 queries.

If you already read my previous blog posts about the Oracle Big Data Cloud Service – Compute Edition, you probably remember that I used a dataset about flights. This time, I’ll use a different dataset. I found movielens database (a movie recommendation service) on the net, and it’s much better than the one I used in my previous posts. It contains movie information and ratings/votes for each movie. It’s very clean and simple to understand.

First, I create a new Zeppelin notebook named Hive (its name doesn’t matter), and in first code block, I write a simple bash scrip to download and unzip the movielens database. Here’s the code:

%sh
cd /tmp/

rm ml-latest.zip

rm -rf /tmp/ml-latest

wget http://files.grouplens.org/datasets/movielens/ml-latest.zip

unzip ml-latest.zip

In case you missed my blog post about Zeppelin, the first statement (%sh) indicates that this is a shell block, so Zeppelin uses shell interpreter to run the content of the block. When we run the above block, the ml-latest.zip will be downloaded and unzipped into the “/tmp/ml-latest” directory. We’ll only use “movies.csv” and “ratings.csv”.

%hive
DROP TABLE IF EXISTS ratings

%hive
DROP TABLE IF EXISTS movies

We need to create the movies and ratings table in Hive, but first we need to drop them if they are already exist. You can skip this step, but it’s useful to put the above blocks so our notebook can be re-run without any errors.

%hive
CREATE TABLE ratings ( userId DECIMAL, movieId DECIMAL, rating DECIMAL, rating_timestamp BIGINT ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
TBLPROPERTIES("skip.header.line.count"="1")

%hive
CREATE TABLE movies ( movieId DECIMAL, title STRING, genres STRING ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
TBLPROPERTIES("skip.header.line.count"="1")

The above blocks will create the Hive tables to hold the movies and ratings data. As you can see, the ratings table has 4 columns (userId, movieId, rating, timestamp) and the movies table has 3 columns (movieId, title, genres). “skip.header.line.count” lets us skip the first line (the header of CSV).

We have various options to import data to these tables. For example, we can use “PIG” to import the data. We could also create these tables as “external tables” with pointing the CSV files, but I’ll use Hive’s native LOAD command to import the data.

%hive
LOAD DATA LOCAL INPATH '/tmp/ml-latest/ratings.csv' OVERWRITE INTO TABLE ratings

%hive
LOAD DATA LOCAL INPATH '/tmp/ml-latest/movies.csv' OVERWRITE INTO TABLE movies

The above two blocks will import the data to the Hive files. The keyword “LOCAL”, tells that we read from a local directory on the server. If we want to read from an hdfs directory, we need to remove the keyword “LOCAL”. The keyword “OVERWRITE”, tells that we want to replace the existing data. If we don’t use that keyword, it appends new data to the existing data.

By the way, Hive blocks are special blocks like Spark SQL blocks and their outputs are shown as data grid, and they come with buttons to generate graphs. On the other hand, the recent queries we executed, do not provide a resul set. So the output won’t be fancy and all we see is “an update count equals -1”. Ignore this -1 result. It’s not an error code. If you get a real error, Hive will return a clean error message.

So we imported the data to our hive tables, we can use traditional SQL queries to fetch data from the tables. I tried to find the top 10 “Thriller” movies which are voted by at least 20.000 users. Here’s the code of the above block:

%hive
SELECT m.title, AVG(r.rating) avg_rating, COUNT(*) total_votes
FROM movies AS m, ratings AS r 
WHERE m.movieid = r.movieid
and m.genres LIKE '%Thriller%'
GROUP BY m.movieid, m.title
HAVING COUNT(*) >= 20000
ORDER BY avg_rating DESC
LIMIT 10

Have you noticed that there is a problem with the average ratings? All of them are “4”. First I thought it’s a problem about the data type of the rating column, but It seems it’s just a bug of the current Zeppelin’s data grid.

I logged in to the cloud server, switch to hive user and issue the “hive” command to start the Hive command line. When I run the above query from the hive command line, I got the correct output.

As you see, Hive is a great software for traditional SQL guys who want to work with Big Data, and it’s ready to use with the Oracle Big Data Cloud Service – Compute Edition.


Viewing all articles
Browse latest Browse all 108

Trending Articles