I though I would stop writing about “Oracle Big Data Cloud Service – Compute Edition” after my fifth blog post, but then I noticed that I didn’t mention about the Apache Hive, another important component of the Big Data. Hive is a data warehouse infrastructure built on top of Hadoop, designed to work with large datasets. Why is it so important? Because it includes support for SQL (SQL:2003 and SQL:2011), and helps users to utilize existing SQL skillsets to quickly derive value from big data.
Although new improvements of Hive project enables sub-second query retrieval (Hive LLAP) but it’s not designed for online transaction processing (OLTP) workloads. Hive is best used for traditional data warehousing tasks.
In this blog post, I’ll demonstrate how we can import data from CSV files into hive tables, and run SQL queries to analyze the date stored in these tables.
We can use a 3rd party Hive client (such as Zeppelin, HUE, DBeaver), ambari Hive view or Hive command line to connect Hive server. I’ll use Zeppelin notebook although it has some limitations when running with Hive: Each block can only run one hive SQL (you can not run several queries by separating them with a semicolon), and in one notebook, you can run up to 10 queries.
If you already read my previous blog posts about the Oracle Big Data Cloud Service – Compute Edition, you probably remember that I used a dataset about flights. This time, I’ll use a different dataset. I found movielens database (a movie recommendation service) on the net, and it’s much better than the one I used in my previous posts. It contains movie information and ratings/votes for each movie. It’s very clean and simple to understand.
First, I create a new Zeppelin notebook named Hive (its name doesn’t matter), and in first code block, I write a simple bash scrip to download and unzip the movielens database. Here’s the code:
%sh cd /tmp/ rm ml-latest.zip rm -rf /tmp/ml-latest wget http://files.grouplens.org/datasets/movielens/ml-latest.zip unzip ml-latest.zip
In case you missed my blog post about Zeppelin, the first statement (%sh) indicates that this is a shell block, so Zeppelin uses shell interpreter to run the content of the block. When we run the above block, the ml-latest.zip will be downloaded and unzipped into the “/tmp/ml-latest” directory. We’ll only use “movies.csv” and “ratings.csv”.
%hive DROP TABLE IF EXISTS ratings
%hive DROP TABLE IF EXISTS movies
We need to create the movies and ratings table in Hive, but first we need to drop them if they are already exist. You can skip this step, but it’s useful to put the above blocks so our notebook can be re-run without any errors.
%hive CREATE TABLE ratings ( userId DECIMAL, movieId DECIMAL, rating DECIMAL, rating_timestamp BIGINT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' TBLPROPERTIES("skip.header.line.count"="1")
%hive CREATE TABLE movies ( movieId DECIMAL, title STRING, genres STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' TBLPROPERTIES("skip.header.line.count"="1")
The above blocks will create the Hive tables to hold the movies and ratings data. As you can see, the ratings table has 4 columns (userId, movieId, rating, timestamp) and the movies table has 3 columns (movieId, title, genres). “skip.header.line.count” lets us skip the first line (the header of CSV).
We have various options to import data to these tables. For example, we can use “PIG” to import the data. We could also create these tables as “external tables” with pointing the CSV files, but I’ll use Hive’s native LOAD command to import the data.
%hive LOAD DATA LOCAL INPATH '/tmp/ml-latest/ratings.csv' OVERWRITE INTO TABLE ratings
%hive LOAD DATA LOCAL INPATH '/tmp/ml-latest/movies.csv' OVERWRITE INTO TABLE movies
The above two blocks will import the data to the Hive files. The keyword “LOCAL”, tells that we read from a local directory on the server. If we want to read from an hdfs directory, we need to remove the keyword “LOCAL”. The keyword “OVERWRITE”, tells that we want to replace the existing data. If we don’t use that keyword, it appends new data to the existing data.
By the way, Hive blocks are special blocks like Spark SQL blocks and their outputs are shown as data grid, and they come with buttons to generate graphs. On the other hand, the recent queries we executed, do not provide a resul set. So the output won’t be fancy and all we see is “an update count equals -1”. Ignore this -1 result. It’s not an error code. If you get a real error, Hive will return a clean error message.
So we imported the data to our hive tables, we can use traditional SQL queries to fetch data from the tables. I tried to find the top 10 “Thriller” movies which are voted by at least 20.000 users. Here’s the code of the above block:
%hive SELECT m.title, AVG(r.rating) avg_rating, COUNT(*) total_votes FROM movies AS m, ratings AS r WHERE m.movieid = r.movieid and m.genres LIKE '%Thriller%' GROUP BY m.movieid, m.title HAVING COUNT(*) >= 20000 ORDER BY avg_rating DESC LIMIT 10
Have you noticed that there is a problem with the average ratings? All of them are “4”. First I thought it’s a problem about the data type of the rating column, but It seems it’s just a bug of the current Zeppelin’s data grid.
I logged in to the cloud server, switch to hive user and issue the “hive” command to start the Hive command line. When I run the above query from the hive command line, I got the correct output.
As you see, Hive is a great software for traditional SQL guys who want to work with Big Data, and it’s ready to use with the Oracle Big Data Cloud Service – Compute Edition.