Use Snowflake and Zepl to Analyse Covid-19 (coronavirus) Data

Coronavirus changed our life, most of us are stuck at home. We are trying to follow everything about the pandemic. So I wanted to write a blog post which will guide to configure an environment that you can examine covid-19 pandemic data. In this blog post, I will show you how you can set up your Snowflake trial account, enable access to covid-19 data provided by Starschema, and how you can use Zepl (Data Science Analytics Platform) to analyse this data.

START YOUR SNOWFLAKE TRIAL

Let’s start with setting up Snowflake trial account. Snowflake is a cloud data platform which is available on major public cloud provides (Amazon, Azure and Google). It provides a cloud-native Data Warehouse. Why am I saying “cloud-native”? Because it was not ported to cloud like many other data warehouse services. It’s designed and built to run on cloud. Therefore it uses underlying cloud services efficiently. I think this is enough for introducing Snowflake, you will be able to discover it by yourself after we set up the trial account.

To create a trial account, you need to visit https://trial.snowflake.com/ page and fill a simple form. It does not require you to enter a credit card or any other payment method.

Please use a browser which supports Java Script when applying for Snowflake trial. Sometimes reCAPTCHA is triggered and you may need to validate that you’re not a bot The trial form asks you to enter your name, title, email, and country you live. You can click “I don’t want to get any marketing emails” to avoid getting marketing emails from Snowflake. After you enter the required information, click “create account” to go to the second part of the form. In second part, we need to configure your account. I will chose “Enterprise Edition” as the Snowflake edition, and I also recommend you to chose it to test all features). I chose Oregon region of Amazon Web Services as deployment region. Of course, you can chose a different region based on your existing Cloud applications. To complete the process, you also need to enter your phone number, and click the checkbox indicating that you read and agree to the terms. Click “finalize setup” to apply for trial account.

You will be forwarded to a new page which says that you will receive an activation email in 15 minutes. The activation mail arrives in a couple of minutes in normal conditions. Please make sure you check your spam/junk boxes. If you do not receive the email, contact support@snowflake.com!

When you receive the activation email, click the “activate your account” button. It will open a new page for activating your account. Enter a username and a password to create your first (admin) user. After creating your first user, you will be able to login to the Snowflake web console. The account URL is written in your activation email. Each Snowflake account has its own unique URL. I recommend you to bookmark your URL for easy access. For more information about Snowflake URL format, please visit Snowflake documentation.

ACCESS TO THE COVID-19 DATA

Snowflake provides a modern web interface to manager your account, and run SQL queries. Before discovering the web interface, let’s enable covid-19 data provided by Starschema. Starschema is a Snowflake partner, and they use secure data sharing feature of Snowflake to share the covid-19 data. This is one of the unique features of Snowflake. It helps customers to share “live” data instantly in a secure way. Changes made in real-time by a data provider are immediately available to data consumers. So as a consumer who accesses a shared data, you do not need to worry about updating the data.

To enable access to covid-19 data, you need to fill “data request” form. When filling the form, do not forget to enter your Snowflake URL. I have to say that approving your access is handled by Starschema and it’s a manual process for now. So it may take one business day to get access to the data.

You will receive an email informing that the covid-19 data is enabled for your account. Login to Snowflake web interface. You will see your username (created when activating), and sysadmin (it is the default role of your user). Although the covid-19 data is shared with us, we need to create a database using the share. Data sharing activities should be handled by account admin, so we click to the down arrow next to our username, and select switch role, and then select “accountadmin” as our active role. For more information about switching roles, please visit Snowflake documentation.

When you switch to accountadmin role, you will be able to use “shares” button on the main menu. Click to “shares” menu, and you will see “COVID19_BY_STARSCHEMA” in the secure shares list. Select it and click “create database from secure share” button.

We need to enter a name for our new database, and select a role which we want to grant access. I selected the public role so any user in my account can access to this data. When you click “create database”, it will create a new database using the shared data. As I already mentioned, whenever the data is updated, you will see the changed instantly. I go back to the worksheets to run my first SQL. Before running my first SQL, I need to select a warehouse, a database, a schema and a role. I can set them using context menu, but I will use SQL commands this time. I run the following commands (by using command+enter or run button over the editor:

use role accountadmin;
use warehouse compute_wh;
use database covid19;
use schema public;

Please note that the role used by worksheet is different than the role we set in the main menu (on top right). If you examine these commands, you may wonder why we are choosing a warehouse. When working with Snowflake, you use virtual warehouses to process your data. These virtual warehouses can be suspended or resized at any time to accommodate the need for more or less compute resources. I will not enter the details here but I highly recommend you to read more at Snowflake documentation.

Here is the summary of the tables exists in shared covid database:

Table Name	Description
CT_US_COVID_TESTS	US COVID-19 testing and mortality
DEMOGRAPHICS	Demographic data (United States), 2019
HDX_ACAPS	ACAPS data on international public health measures
HS_BULK_DATA	Global data on healthcare providers
HUM_RESTRICTIONS_AIRLINE	COVID-19 travel restrictions by airline
HUM_RESTRICTIONS_COUNTRY	COVID-19 travel restrictions by country
JHU_COVID_19	Global case counts
KFF_HCP_CAPACITY	US healthcare capacity by state, 2018
KFF_US_ICU_BEDS	ICU beds, US
KFF_US_POLICY_ACTIONS	US policy actions by state
KFF_US_STATE_MITIGATIONS	US actions to mitigate spread, by state
NYT_US_COVID19	COVID-19 cases and deaths, US, county level
PCM_DPS_COVID19	Italy case statistics, summary
SCS_BE_DETAILED_HOSPITALISATIONS	Hospitalisations and details on patient disposition
SCS_BE_DETAILED_MORTALITY	Detailed data on mortality
SCS_BE_DETAILED_PROVINCE_CASE_COUNTS	Detailed data on case counts in Belgium and Luxembourg
SCS_BE_DETAILED_TESTS	Day-to-day time series on tests performed
VH_CAN_DETAILED	COVID-19 cases and deaths, Canada, province level
WHO_SITUATION_REPORTS	WHO situation reports

Using “show tables” and “describe table” commands, or object browser, you can get more information about the table details. I will use “jhu_covid_19” table to get the number of confirmed case of Italy. Here’s the query:

select date, cases
from jhu_covid_19 where country_region = 'Italy'
and case_type = 'Confirmed'
and cases > 0
order by date;

Let’s compare Italy to Netherlands. I chose Netherlands because I live there. When I check the case numbers in Italy, I saw that the number of cases were very low in first 20 days. It might be related with the number of tests, or lack of test kits. So I will start when there are at least 100 confirmed cases. This will be my query:

with Italy as 
(select row_number() over (order by date) dayno, cases, difference 
from jhu_covid_19 
where country_region = 'Italy'
and case_type = 'Confirmed'
and cases >= 100),
Netherlands as 
(select row_number() over (order by date) dayno, cases, difference
from jhu_covid_19 
where country_region = 'Netherlands'
and case_type = 'Confirmed'
and cases >= 100)
select Netherlands.dayno, Netherlands.cases as Netherlands, Netherlands.difference as NLdelta, 
Italy.cases as Italy, Italy.difference as ITdelta
from Netherlands
join Italy on Netherlands.dayno = Italy.dayno
order by Netherlands.dayno;

As you can see, I use row_number function to generate relative day number (after 100th case), and list confirmed cases and difference (how many new cases are detected on last day). Here is the result:

Here is another sample. This time I will check the the mortality rate and confirmed cases (again for Netherlands). The jhu_covid_19 table has 4 rows for each day for each location. Again I will ignore the first days of the pandemic so I will start from 100th case.

select row_number() over (order by date) dayno,
sum(IFF( case_type = 'Confirmed', cases, 0 )) confirmed_cases,
sum(IFF( case_type = 'Deaths', cases, 0 )) deaths,
round( deaths / confirmed_cases * 100, 2) mortality_rate
from jhu_covid_19 
where country_region = 'Netherlands'
and date > (select min(date) from jhu_covid_19
where case_type = 'Confirmed' and cases >= 100 and country_region = 'Netherlands')
group by date;

Here is the result of the above query:

You may run this query for other countries, and you will see them mortality rate is increasing by the number of confirmed cases. The mortality rate was under %1 while Netherlands have 188 cases, and now it is %9.42. This is probably related with the capacity of the health system.

ENABLE ZEPL TRIAL ACCOUNT

Although Snowflake web interface is modern and user friendly, it is not able to generate charts based on the result (yet). So I go to the “partner connect” tab, and activate “Zepl trial account”.

Zepl is one of the partners of Snowflake which we will use to complete our environment. They are the creators of Apache Zeppelin, and will provide Zeppelin notebooks loaded with the drivers to connect to Snowflake.

When activating the Zepl account, it will ask you to confirm that we will allow them to create a database (PC_ZEPL_DB), a warehouse (PC_ZEPL_WH), a user (PC_ZEPL_USER) and a role (PC_ZEPL_ROLE). This will help Zepl to create a sample connection to your account. If you want, you can visit https://www.zepl.com and create a trial account manually. In fact, even when enabling through Snowflake, you need to fill a registration form to activate the Zepl account. After completing registration process, you will receive a welcome email. You can login to your account using the partner connect page or by visiting zepl.com website.

Zepl enables you to run Python, Spark and SQLs on supported data sources such as Snowflake. If you are familiar with Zeppelin notebooks, you master the interface in a couple of minutes. I click “New” button on the left side, and select “New Notebook”. I enter a name to my notebook (covid19sql) and accept all default values, and click “create” to create my notebook.

On the right side of the notebook, you can add/remove/list data sources. Because I activated the Zepl account through Snowflake, a connection to my account is already defined. I click “plus” button in front of the connection name to add it to my notebook. If you don’t see any connection, you can click “add new” and define a connection to your Snowflake account.

To be able to query my datasource, I need to enter “%datasource.DATASOURCENAME” as interpreter. The predefined connection uses PC_ZEPL_DB, so I set my active database to covid19 and then I run a similar query to compare the number of cases of Netherlands and Italy. I removed difference column as we can already see the increase on the chart. I choose the “line” chart, and set Netherlands and Italy as Y-Axis, and dayno as X-Axis.

Zepl is much more than adding some charts to your results. You can use Spark or Python to process your data. Here’s a simple script to demonstrate some capabilities of Zepl:

%python

import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing

conn = z.getDatasource("snowflake_GOA51354_2f228e")
cases = conn.execute("select date, cases from covid19.public.jhu_covid_19 where country_region = 'Netherlands' and case_type = 'Confirmed' and cases>100 order by date")

cases_df = pd.DataFrame(cases)
cases_df.columns = ['date','cases']

model = ExponentialSmoothing(cases_df['cases'], trend = 'mul' )
model_fit = model.fit()

result = model_fit.predict( len(cases_df['cases']), len(cases_df['cases']) + 30 ).to_frame().astype(int64)
result.columns = ['cases']

result['date'] = pd.date_range( start=cases_df['date'].iloc[-1] + datetime.timedelta(days=1) , periods=len(result), freq='D'  ).date

total = cases_df.append( result )

z.show(total)

The “z” is a predefined Zeppelin Context object. You can use it to access Data Sources (such as Snowflake), and you can also use it to draw charts and other Zeppelin functions. Pandas and other statistic libraries are already installed, and it’s also possible to install other Python libraries using pip command. My script, reads the confirmed case numbers from Snowflake covid19 database, and it tries to predict number of cases on next 30 days based on ExponentialSmoothing model. “z.show” is used to show a chart using the values of pandas dataframe. I set “date” column as X-Axis, and “cases” column as Y-Axis, and chose original Zeppelin “line” chart this time.

Our prediction shows that we will have more than 95.000 confirmed covid-19 cases in Netherlands on 05-05-2020. Of course, this is a very simple prediction which ignores all the measurements taken by the government. I’m sure you can write a better one. I just wrote this to demonstrate that how you can analyze covid-19 data hosted on Snowflake, using Zepl (Python). If you have any questions, please let me know. I will try to write more blog posts about Snowflake, if I receive questions.

Stay home and stay safe!

Use Snowflake and Zepl to Analyse Covid-19 (coronavirus) Data

START YOUR SNOWFLAKE TRIAL

ACCESS TO THE COVID-19 DATA

ENABLE ZEPL TRIAL ACCOUNT

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112