Spark applications may depend on third-party Java or Scala packages stored in Maven repository, and these packages can be included by “–packages” parameter when submitting a Spark job. For example, if we want to include postgresql package to our Spark job when submitting from command line, we can use the following the syntax:
spark-submit --packages org.postgresql:postgresql:9.4.1211
We can also use the same parameter when launching interactive PySpark session:
pyspark --packages org.postgresql:postgresql:9.4.1211
Jupyterhub does not let you use this parameter. Jupyterhub is installed as docker image on EMR cluster, and it uses Livy to submit Spark jobs. When you connect a Jupyterhub notebook and try to run any Spark script, Spark context is created automatically therefore it’s not possible to modify Spark configuration to include any packages. On the other hand, Jupyterhub reads /etc/jupyter/conf/config.json when creating a Spark context. If we modify this file, we can include any Maven package to our Spark job.
Here are the steps to include a package from Maven repository:
1) Connect to your EMR master node, run the following command to copy “config.json” file from Jupyter docker image:
sudo docker cp jupyterhub:/etc/jupyter/conf/config.json .
2- Edit the file using “vi” or similar editor, and add the required packages (as comma-delimited list) in “session_configs” part. For example, I entered the postgresql package:
"conf": {"spark.jars.packages": "org.postgresql:postgresql:9.4.1211"}
To help you understand the required modification, I’m sharing my original “config.json” file:
{"kernel_python_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998", "auth":"None"},"kernel_scala_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998", "auth":"None"},"kernel_r_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998"}, "logging_config":{"version":1,"formatters":{"magicsFormatter":{"format":"%(asctime)s\t%(levelname)s\t%(message)s","datefmt":""}}, "handlers":{"magicsHandler":{"class":"hdijupyterutils.filehandler.MagicsFileHandler","formatter":"magicsFormatter", "home_path":"~/.sparkmagic"}},"loggers":{"magicsLogger":{"handlers":["magicsHandler"],"level":"DEBUG","propagate":0}}}, "wait_for_idle_timeout_seconds":15,"livy_session_startup_timeout_seconds":60,"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.","ignore_ssl_errors":false,"session_configs":{ "driverMemory":"1000M","executorCores":2},"use_auto_viz":true, "coerce_dataframe":true,"max_results_sql":2500, "pyspark_dataframe_encoding":"utf-8","heartbeat_refresh_seconds":30,"livy_server_heartbeat_timeout_seconds":0, "heartbeat_retry_seconds":10,"server_extension_default_kernel_name":"pysparkkernel","custom_headers":{},"retry_policy": "configurable","retry_seconds_to_sleep_list":[0.2,0.5,1.0,3.0,5.0],"configurable_retry_policy_max_retries":8}
Here is the modified one:
{"kernel_python_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998", "auth":"None"},"kernel_scala_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998", "auth":"None"},"kernel_r_credentials":{"username":"","password":"","url":"http://ip-##-##-##-##.eu-west-1.compute.internal:8998"}, "logging_config":{"version":1,"formatters":{"magicsFormatter":{"format":"%(asctime)s\t%(levelname)s\t%(message)s","datefmt":""}}, "handlers":{"magicsHandler":{"class":"hdijupyterutils.filehandler.MagicsFileHandler","formatter":"magicsFormatter", "home_path":"~/.sparkmagic"}},"loggers":{"magicsLogger":{"handlers":["magicsHandler"],"level":"DEBUG","propagate":0}}}, "wait_for_idle_timeout_seconds":15,"livy_session_startup_timeout_seconds":60,"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.","ignore_ssl_errors":false,"session_configs":{ "conf": {"spark.jars.packages": "org.postgresql:postgresql:9.4.1211"}, "driverMemory":"1000M","executorCores":2},"use_auto_viz":true, "coerce_dataframe":true,"max_results_sql":2500, "pyspark_dataframe_encoding":"utf-8","heartbeat_refresh_seconds":30,"livy_server_heartbeat_timeout_seconds":0, "heartbeat_retry_seconds":10,"server_extension_default_kernel_name":"pysparkkernel","custom_headers":{},"retry_policy": "configurable","retry_seconds_to_sleep_list":[0.2,0.5,1.0,3.0,5.0],"configurable_retry_policy_max_retries":8}
3- After you make the modification, copy the file back into Jupyter docker image:
sudo docker cp config.json jupyterhub:/etc/jupyter/conf/config.json
4- Restart the kernel on your Jupyter notebook, then you should be able to use the package.
Hope it helps.