Quantcast
Channel: Gokhan Atil – Gokhan Atil’s Blog
Viewing all articles
Browse latest Browse all 108

How to include 3rd party (Maven) dependencies to Spark jobs submitted from Jupyterhub on EMR

$
0
0

Spark applications may depend on third-party Java or Scala packages stored in Maven repository, and these packages can be included by “–packages” parameter when submitting a Spark job. For example, if we want to include postgresql package to our Spark job when submitting from command line, we can use the following the syntax:

spark-submit --packages org.postgresql:postgresql:9.4.1211

We can also use the same parameter when launching interactive PySpark session:

pyspark --packages org.postgresql:postgresql:9.4.1211

Jupyterhub does not let you use this parameter. Jupyterhub is installed as docker image on EMR cluster, and it uses Livy to submit Spark jobs. When you connect a Jupyterhub notebook and try to run any Spark script, Spark context is created automatically therefore it’s not possible to modify Spark configuration to include any packages. On the other hand, Jupyterhub reads /etc/jupyter/conf/config.json when creating a Spark context. If we modify this file, we can include any Maven package to our Spark job.

Here are the steps to include a package from Maven repository:

1) Connect to your EMR master node, run the following command to copy “config.json” file from Jupyter docker image:

sudo docker cp jupyterhub:/etc/jupyter/conf/config.json .

2- Edit the file using “vi” or similar editor, and add the required packages (as comma-delimited list) in “session_configs” part. For example, I entered the postgresql package:

"conf": {"spark.jars.packages": "org.postgresql:postgresql:9.4.1211"}

To help you understand the required modification, I’m sharing my original “config.json” file:

{"kernel_python_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_scala_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_r_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998"},
"logging_config":{"version":1,"formatters":{"magicsFormatter":{"format":"%(asctime)s\t%(levelname)s\t%(message)s","datefmt":""}},
"handlers":{"magicsHandler":{"class":"hdijupyterutils.filehandler.MagicsFileHandler","formatter":"magicsFormatter",
"home_path":"~/.sparkmagic"}},"loggers":{"magicsLogger":{"handlers":["magicsHandler"],"level":"DEBUG","propagate":0}}},
"wait_for_idle_timeout_seconds":15,"livy_session_startup_timeout_seconds":60,"fatal_error_suggestion":
"The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources 
for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is 
configured correctly.\nc) Restart the kernel.","ignore_ssl_errors":false,"session_configs":{  
"driverMemory":"1000M","executorCores":2},"use_auto_viz":true, "coerce_dataframe":true,"max_results_sql":2500, 
"pyspark_dataframe_encoding":"utf-8","heartbeat_refresh_seconds":30,"livy_server_heartbeat_timeout_seconds":0,
"heartbeat_retry_seconds":10,"server_extension_default_kernel_name":"pysparkkernel","custom_headers":{},"retry_policy":
"configurable","retry_seconds_to_sleep_list":[0.2,0.5,1.0,3.0,5.0],"configurable_retry_policy_max_retries":8}

Here is the modified one:

{"kernel_python_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_scala_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998",
"auth":"None"},"kernel_r_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998"},
"logging_config":{"version":1,"formatters":{"magicsFormatter":{"format":"%(asctime)s\t%(levelname)s\t%(message)s","datefmt":""}},
"handlers":{"magicsHandler":{"class":"hdijupyterutils.filehandler.MagicsFileHandler","formatter":"magicsFormatter",
"home_path":"~/.sparkmagic"}},"loggers":{"magicsLogger":{"handlers":["magicsHandler"],"level":"DEBUG","propagate":0}}},
"wait_for_idle_timeout_seconds":15,"livy_session_startup_timeout_seconds":60,"fatal_error_suggestion":
"The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources 
for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is 
configured correctly.\nc) Restart the kernel.","ignore_ssl_errors":false,"session_configs":{  
"conf": {"spark.jars.packages": "org.postgresql:postgresql:9.4.1211"}, 
"driverMemory":"1000M","executorCores":2},"use_auto_viz":true, "coerce_dataframe":true,"max_results_sql":2500, 
"pyspark_dataframe_encoding":"utf-8","heartbeat_refresh_seconds":30,"livy_server_heartbeat_timeout_seconds":0,
"heartbeat_retry_seconds":10,"server_extension_default_kernel_name":"pysparkkernel","custom_headers":{},"retry_policy":
"configurable","retry_seconds_to_sleep_list":[0.2,0.5,1.0,3.0,5.0],"configurable_retry_policy_max_retries":8}

3- After you make the modification, copy the file back into Jupyter docker image:

sudo docker cp config.json jupyterhub:/etc/jupyter/conf/config.json

4- Restart the kernel on your Jupyter notebook, then you should be able to use the package.

Hope it helps.


Viewing all articles
Browse latest Browse all 108

Trending Articles