
Using ipython notebook with Apache Spark couldn’t be easier. This post will cover how to use ipython notebook (jupyter) with Spark and why it is best choice when using python with Spark.
Requirements
This post assumes you have downloaded and extracted Apache Spark and you are running on a Mac or *nix. If you are on Windows see http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/
ipython notebook with Apache Spark
At time of this writing in January 2016, I recommend the use the Python 2.7 Anaconda Python distribution which can be downloaded here https://www.continuum.io/downloads. It contains more than 300 of the most popular python packages for science, math, engineering, and data analysis.
After you have Anaconda installed, you should make sure that ipython notebook (Jupyter) is up to date. Run the following command in the Terminal (Mac/Linux) or Command Prompt (Windows):
conda update conda
conda update ipython
Ref: http://ipython.org/install.html in the section “I am getting started with Python” section
Launching ipython notebook with Apache Spark
1) In a terminal, go to the root of your Spark install and enter the following command
IPYTHON_OPTS=”notebook” ./bin/pyspark
A browser tab should launch and various output to your terminal window depending on your logging level.
What’s going on here with IPYTHON_OPTS command to pyspark? Well, you can look at the source of bin/pyspark in a text editor. This section
# Determine the Python executable to use for the driver: if [[ -n "$IPYTHON_OPTS" || "$IPYTHON" == "1" ]]; then # If IPython options are specified, assume user wants to run IPython # (for backwards-compatibility) PYSPARK_DRIVER_PYTHON_OPTS="$PYSPARK_DRIVER_PYTHON_OPTS $IPYTHON_OPTS" PYSPARK_DRIVER_PYTHON="ipython" elif [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"$DEFAULT_PYTHON"}" fi
Hopefully, this snippet makes sense. If IPYTHON_OPTS is present, use ipython.
Verify Spark with ipython notebook
At this point, you should be able to create a new notebook and execute some python using the provided SparkContext. For example:
print sc
or
print sc.version
Here’s a screencast of running ipython notebook with pyspark on my laptop
In this screencast, pay special attention to your terminal window log statements. At the default log level of INFO, you should see the no errors in pyspark output. Also, when you start a new notebook, the terminal should show SparkContext sc being available for use, such as
INFO SparkContext: Running Spark version
Why use ipython notebook with Spark?
1) Same reasons you use ipython notebook without Spark such as convenience, easy to share and execute notebooks, etc.
2) Code completion. As the screencast shows, a python spark developer can hit the tab key for available functions or also known as code completion options.
Hope this helps, let me know if you have any questions. To continue with Python in Spark, check out the Spark Transformations in Python and Spark Actions in Python tutorials.
Next tutorial with ipython is ipython with a Spark Cluster.
For those interested wanting to run pyspark using Jupyter notebooks (the successor of IPython notebooks), you can use:
PYSPARK_DRIVER_PYTHON=”jupyter” PYSPARK_DRIVER_PYTHON_OPTS=”notebook” pyspark
Thanks so much!