How to include 3rd party (Maven) dependencies to Spark jobs submitted from Jupyterhub on EMR

Spark applications may depend on third-party Java or Scala packages stored in Maven repository, and these packages can be included by “–packages” parameter when submitting a Spark job. For example, if we want to include postgresql package to our Spark job when submitting from command line, we can use the following the syntax:

spark-submit --packages org.postgresql:postgresql:9.4.1211

We can also use the same parameter when launching interactive PySpark session:

pyspark --packages org.postgresql:postgresql:9.4.1211

Jupyterhub does not let you use this parameter. Jupyterhub is installed as docker image on EMR cluster, and it uses Livy to submit Spark jobs. When you connect a Jupyterhub notebook and try to run any Spark script, Spark context is created automatically therefore it’s not possible to modify Spark configuration to include any packages. On the other hand, Jupyterhub reads /etc/jupyter/conf/config.json when creating a Spark context. If we modify this file, we can include any Maven package to our Spark job.

Here are the steps to include a package from Maven repository:

1) Connect to your EMR master node, run the following command to copy “config.json” file from Jupyter docker image:

sudo docker cp jupyterhub:/etc/jupyter/conf/config.json .

2- Edit the file using “vi” or similar editor, and add the required packages (as comma-delimited list) in “session_configs” part. For example, I entered the postgresql package:

"conf": {"spark.jars.packages": "org.postgresql:postgresql:9.4.1211"}

To help you understand the required modification, I’m sharing my original “config.json” file:

{"kernel_python_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_scala_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_r_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998"},
"logging_config":{"version":1,"formatters":{"magicsFormatter":{"format":"%(asctime)s\t%(levelname)s\t%(message)s","datefmt":""}},
"handlers":{"magicsHandler":{"class":"hdijupyterutils.filehandler.MagicsFileHandler","formatter":"magicsFormatter",
"home_path":"~/.sparkmagic"}},"loggers":{"magicsLogger":{"handlers":["magicsHandler"],"level":"DEBUG","propagate":0}}},
"wait_for_idle_timeout_seconds":15,"livy_session_startup_timeout_seconds":60,"fatal_error_suggestion":
"The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources 
for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is 
configured correctly.\nc) Restart the kernel.","ignore_ssl_errors":false,"session_configs":{  
"driverMemory":"1000M","executorCores":2},"use_auto_viz":true, "coerce_dataframe":true,"max_results_sql":2500, 
"pyspark_dataframe_encoding":"utf-8","heartbeat_refresh_seconds":30,"livy_server_heartbeat_timeout_seconds":0,
"heartbeat_retry_seconds":10,"server_extension_default_kernel_name":"pysparkkernel","custom_headers":{},"retry_policy":
"configurable","retry_seconds_to_sleep_list":[0.2,0.5,1.0,3.0,5.0],"configurable_retry_policy_max_retries":8}

Here is the modified one:

{"kernel_python_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_scala_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_r_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998"},
"logging_config":{"version":1,"formatters":{"magicsFormatter":{"format":"%(asctime)s\t%(levelname)s\t%(message)s","datefmt":""}},
"handlers":{"magicsHandler":{"class":"hdijupyterutils.filehandler.MagicsFileHandler","formatter":"magicsFormatter",
"home_path":"~/.sparkmagic"}},"loggers":{"magicsLogger":{"handlers":["magicsHandler"],"level":"DEBUG","propagate":0}}},
"wait_for_idle_timeout_seconds":15,"livy_session_startup_timeout_seconds":60,"fatal_error_suggestion":
"The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources 
for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is 
configured correctly.\nc) Restart the kernel.","ignore_ssl_errors":false,"session_configs":{  
"conf": {"spark.jars.packages": "org.postgresql:postgresql:9.4.1211"}, 
"driverMemory":"1000M","executorCores":2},"use_auto_viz":true, "coerce_dataframe":true,"max_results_sql":2500, 
"pyspark_dataframe_encoding":"utf-8","heartbeat_refresh_seconds":30,"livy_server_heartbeat_timeout_seconds":0,
"heartbeat_retry_seconds":10,"server_extension_default_kernel_name":"pysparkkernel","custom_headers":{},"retry_policy":
"configurable","retry_seconds_to_sleep_list":[0.2,0.5,1.0,3.0,5.0],"configurable_retry_policy_max_retries":8}

3- After you make the modification, copy the file back into Jupyter docker image:

sudo docker cp config.json jupyterhub:/etc/jupyter/conf/config.json

4- Restart the kernel on your Jupyter notebook, then you should be able to use the package.

Hope it helps.

How to include 3rd party (Maven) dependencies to Spark jobs submitted from Jupyterhub on EMR

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List