Coronavirus changed our life, most of us are stuck at home. We are trying to follow everything about the pandemic. So I wanted to write a blog post which will guide to configure an environment that you can examine covid-19 pandemic data. In this blog post, I will show you how you can set up your Snowflake trial account, enable access to covid-19 data provided by Starschema, and how you can use Zepl (Data Science Analytics Platform) to analyse this data.
START YOUR SNOWFLAKE TRIAL
Let’s start with setting up Snowflake trial account. Snowflake is a cloud data platform which is available on major public cloud provides (Amazon, Azure and Google). It provides a cloud-native Data Warehouse. Why am I saying “cloud-native”? Because it was not ported to cloud like many other data warehouse services. It’s designed and built to run on cloud. Therefore it uses underlying cloud services efficiently. I think this is enough for introducing Snowflake, you will be able to discover it by yourself after we set up the trial account.
To create a trial account, you need to visit https://trial.snowflake.com/ page and fill a simple form. It does not require you to enter a credit card or any other payment method.
Please use a browser which supports Java Script when applying for Snowflake trial. Sometimes reCAPTCHA is triggered and you may need to validate that you’re not a bot The trial form asks you to enter your name, title, email, and country you live. You can click “I don’t want to get any marketing emails” to avoid getting marketing emails from Snowflake. After you enter the required information, click “create account” to go to the second part of the form. In second part, we need to configure your account. I will chose “Enterprise Edition” as the Snowflake edition, and I also recommend you to chose it to test all features). I chose Oregon region of Amazon Web Services as deployment region. Of course, you can chose a different region based on your existing Cloud applications. To complete the process, you also need to enter your phone number, and click the checkbox indicating that you read and agree to the terms. Click “finalize setup” to apply for trial account.
You will be forwarded to a new page which says that you will receive an activation email in 15 minutes. The activation mail arrives in a couple of minutes in normal conditions. Please make sure you check your spam/junk boxes. If you do not receive the email, contact support@snowflake.com!
When you receive the activation email, click the “activate your account” button. It will open a new page for activating your account. Enter a username and a password to create your first (admin) user. After creating your first user, you will be able to login to the Snowflake web console. The account URL is written in your activation email. Each Snowflake account has its own unique URL. I recommend you to bookmark your URL for easy access. For more information about Snowflake URL format, please visit Snowflake documentation.
ACCESS TO THE COVID-19 DATA
Snowflake provides a modern web interface to manager your account, and run SQL queries. Before discovering the web interface, let’s enable covid-19 data provided by Starschema. Starschema is a Snowflake partner, and they use secure data sharing feature of Snowflake to share the covid-19 data. This is one of the unique features of Snowflake. It helps customers to share “live” data instantly in a secure way. Changes made in real-time by a data provider are immediately available to data consumers. So as a consumer who accesses a shared data, you do not need to worry about updating the data.
To enable access to covid-19 data, you need to fill “data request” form. When filling the form, do not forget to enter your Snowflake URL. I have to say that approving your access is handled by Starschema and it’s a manual process for now. So it may take one business day to get access to the data.
You will receive an email informing that the covid-19 data is enabled for your account. Login to Snowflake web interface. You will see your username (created when activating), and sysadmin (it is the default role of your user). Although the covid-19 data is shared with us, we need to create a database using the share. Data sharing activities should be handled by account admin, so we click to the down arrow next to our username, and select switch role, and then select “accountadmin” as our active role. For more information about switching roles, please visit Snowflake documentation.
When you switch to accountadmin role, you will be able to use “shares” button on the main menu. Click to “shares” menu, and you will see “COVID19_BY_STARSCHEMA” in the secure shares list. Select it and click “create database from secure share” button.
We need to enter a name for our new database, and select a role which we want to grant access. I selected the public role so any user in my account can access to this data. When you click “create database”, it will create a new database using the shared data. As I already mentioned, whenever the data is updated, you will see the changed instantly. I go back to the worksheets to run my first SQL. Before running my first SQL, I need to select a warehouse, a database, a schema and a role. I can set them using context menu, but I will use SQL commands this time. I run the following commands (by using command+enter or run button over the editor:
use role accountadmin; use warehouse compute_wh; use database covid19; use schema public;
Please note that the role used by worksheet is different than the role we set in the main menu (on top right). If you examine these commands, you may wonder why we are choosing a warehouse. When working with Snowflake, you use virtual warehouses to process your data. These virtual warehouses can be suspended or resized at any time to accommodate the need for more or less compute resources. I will not enter the details here but I highly recommend you to read more at Snowflake documentation.
Here is the summary of the tables exists in shared covid database:
Table Name | Description |
CT_US_COVID_TESTS | US COVID-19 testing and mortality |
DEMOGRAPHICS | Demographic data (United States), 2019 |
HDX_ACAPS | ACAPS data on international public health measures |
HS_BULK_DATA | Global data on healthcare providers |
HUM_RESTRICTIONS_AIRLINE | COVID-19 travel restrictions by airline |
HUM_RESTRICTIONS_COUNTRY | COVID-19 travel restrictions by country |
JHU_COVID_19 | Global case counts |
KFF_HCP_CAPACITY | US healthcare capacity by state, 2018 |
KFF_US_ICU_BEDS | ICU beds, US |
KFF_US_POLICY_ACTIONS | US policy actions by state |
KFF_US_STATE_MITIGATIONS | US actions to mitigate spread, by state |
NYT_US_COVID19 | COVID-19 cases and deaths, US, county level |
PCM_DPS_COVID19 | Italy case statistics, summary |
SCS_BE_DETAILED_HOSPITALISATIONS | Hospitalisations and details on patient disposition |
SCS_BE_DETAILED_MORTALITY | Detailed data on mortality |
SCS_BE_DETAILED_PROVINCE_CASE_COUNTS | Detailed data on case counts in Belgium and Luxembourg |
SCS_BE_DETAILED_TESTS | Day-to-day time series on tests performed |
VH_CAN_DETAILED | COVID-19 cases and deaths, Canada, province level |
WHO_SITUATION_REPORTS | WHO situation reports |
Using “show tables” and “describe table” commands, or object browser, you can get more information about the table details. I will use “jhu_covid_19” table to get the number of confirmed case of Italy. Here’s the query:
select date, cases from jhu_covid_19 where country_region = 'Italy' and case_type = 'Confirmed' and cases > 0 order by date;
Let’s compare Italy to Netherlands. I chose Netherlands because I live there. When I check the case numbers in Italy, I saw that the number of cases were very low in first 20 days. It might be related with the number of tests, or lack of test kits. So I will start when there are at least 100 confirmed cases. This will be my query:
with Italy as (select row_number() over (order by date) dayno, cases, difference from jhu_covid_19 where country_region = 'Italy' and case_type = 'Confirmed' and cases >= 100), Netherlands as (select row_number() over (order by date) dayno, cases, difference from jhu_covid_19 where country_region = 'Netherlands' and case_type = 'Confirmed' and cases >= 100) select Netherlands.dayno, Netherlands.cases as Netherlands, Netherlands.difference as NLdelta, Italy.cases as Italy, Italy.difference as ITdelta from Netherlands join Italy on Netherlands.dayno = Italy.dayno order by Netherlands.dayno;
As you can see, I use row_number function to generate relative day number (after 100th case), and list confirmed cases and difference (how many new cases are detected on last day). Here is the result:
Here is another sample. This time I will check the the mortality rate and confirmed cases (again for Netherlands). The jhu_covid_19 table has 4 rows for each day for each location. Again I will ignore the first days of the pandemic so I will start from 100th case.
select row_number() over (order by date) dayno, sum(IFF( case_type = 'Confirmed', cases, 0 )) confirmed_cases, sum(IFF( case_type = 'Deaths', cases, 0 )) deaths, round( deaths / confirmed_cases * 100, 2) mortality_rate from jhu_covid_19 where country_region = 'Netherlands' and date > (select min(date) from jhu_covid_19 where case_type = 'Confirmed' and cases >= 100 and country_region = 'Netherlands') group by date;
Here is the result of the above query:
You may run this query for other countries, and you will see them mortality rate is increasing by the number of confirmed cases. The mortality rate was under %1 while Netherlands have 188 cases, and now it is %9.42. This is probably related with the capacity of the health system.
ENABLE ZEPL TRIAL ACCOUNT
Although Snowflake web interface is modern and user friendly, it is not able to generate charts based on the result (yet). So I go to the “partner connect” tab, and activate “Zepl trial account”.
Zepl is one of the partners of Snowflake which we will use to complete our environment. They are the creators of Apache Zeppelin, and will provide Zeppelin notebooks loaded with the drivers to connect to Snowflake.
When activating the Zepl account, it will ask you to confirm that we will allow them to create a database (PC_ZEPL_DB), a warehouse (PC_ZEPL_WH), a user (PC_ZEPL_USER) and a role (PC_ZEPL_ROLE). This will help Zepl to create a sample connection to your account. If you want, you can visit https://www.zepl.com and create a trial account manually. In fact, even when enabling through Snowflake, you need to fill a registration form to activate the Zepl account. After completing registration process, you will receive a welcome email. You can login to your account using the partner connect page or by visiting zepl.com website.
Zepl enables you to run Python, Spark and SQLs on supported data sources such as Snowflake. If you are familiar with Zeppelin notebooks, you master the interface in a couple of minutes. I click “New” button on the left side, and select “New Notebook”. I enter a name to my notebook (covid19sql) and accept all default values, and click “create” to create my notebook.
On the right side of the notebook, you can add/remove/list data sources. Because I activated the Zepl account through Snowflake, a connection to my account is already defined. I click “plus” button in front of the connection name to add it to my notebook. If you don’t see any connection, you can click “add new” and define a connection to your Snowflake account.
To be able to query my datasource, I need to enter “%datasource.DATASOURCENAME” as interpreter. The predefined connection uses PC_ZEPL_DB, so I set my active database to covid19 and then I run a similar query to compare the number of cases of Netherlands and Italy. I removed difference column as we can already see the increase on the chart. I choose the “line” chart, and set Netherlands and Italy as Y-Axis, and dayno as X-Axis.
Zepl is much more than adding some charts to your results. You can use Spark or Python to process your data. Here’s a simple script to demonstrate some capabilities of Zepl:
%python import pandas as pd from statsmodels.tsa.holtwinters import ExponentialSmoothing conn = z.getDatasource("snowflake_GOA51354_2f228e") cases = conn.execute("select date, cases from covid19.public.jhu_covid_19 where country_region = 'Netherlands' and case_type = 'Confirmed' and cases>100 order by date") cases_df = pd.DataFrame(cases) cases_df.columns = ['date','cases'] model = ExponentialSmoothing(cases_df['cases'], trend = 'mul' ) model_fit = model.fit() result = model_fit.predict( len(cases_df['cases']), len(cases_df['cases']) + 30 ).to_frame().astype(int64) result.columns = ['cases'] result['date'] = pd.date_range( start=cases_df['date'].iloc[-1] + datetime.timedelta(days=1) , periods=len(result), freq='D' ).date total = cases_df.append( result ) z.show(total)
The “z” is a predefined Zeppelin Context object. You can use it to access Data Sources (such as Snowflake), and you can also use it to draw charts and other Zeppelin functions. Pandas and other statistic libraries are already installed, and it’s also possible to install other Python libraries using pip command. My script, reads the confirmed case numbers from Snowflake covid19 database, and it tries to predict number of cases on next 30 days based on ExponentialSmoothing model. “z.show” is used to show a chart using the values of pandas dataframe. I set “date” column as X-Axis, and “cases” column as Y-Axis, and chose original Zeppelin “line” chart this time.
Our prediction shows that we will have more than 95.000 confirmed covid-19 cases in Netherlands on 05-05-2020. Of course, this is a very simple prediction which ignores all the measurements taken by the government. I’m sure you can write a better one. I just wrote this to demonstrate that how you can analyze covid-19 data hosted on Snowflake, using Zepl (Python). If you have any questions, please let me know. I will try to write more blog posts about Snowflake, if I receive questions.
Stay home and stay safe!