Apache Spark and ipython notebook – The Easy Way

ipython-notebook-spark

Using ipython notebook with Apache Spark couldn’t be easier.  This post will cover how to use ipython notebook (jupyter) with Spark and why it is best choice when using python with Spark.

Requirements

This post assumes you have downloaded and extracted Apache Spark and you are running on a Mac or *nix.  If you are on Windows see http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

ipython notebook with Apache Spark

I recommend the use the Python 2.7 Anaconda Python distribution which can be downloaded here https://www.continuum.io/downloads.  It contains more than 300 of the most popular python packages for science, math, engineering, and data analysis.  Also, future python spark tutorials and python spark examples will use this distribution.

After you have Anaconda installed, you should make sure that ipython notebook (Jupyter) is up to date. Run the following command in the Terminal (Mac/Linux) or Command Prompt (Windows):

conda update conda
conda update ipython

Ref: http://ipython.org/install.html in the section “I am getting started with Python” section

Launching ipython notebook with Apache Spark

1) In a terminal, go to the root of your Spark install and enter the following command

IPYTHON_OPTS=”notebook” ./bin/pyspark

A browser tab should launch and various output to your terminal window depending on your logging level.

What’s going on here with IPYTHON_OPTS command to pyspark?  Well, you can look at the source of bin/pyspark in a text editor.  This section

# Determine the Python executable to use for the driver:
if [[ -n "$IPYTHON_OPTS" || "$IPYTHON" == "1" ]]; then
  # If IPython options are specified, assume user wants to run IPython
  # (for backwards-compatibility)
  PYSPARK_DRIVER_PYTHON_OPTS="$PYSPARK_DRIVER_PYTHON_OPTS $IPYTHON_OPTS"
  PYSPARK_DRIVER_PYTHON="ipython"
elif [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
  PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"$DEFAULT_PYTHON"}"
fi

Hopefully, this snippet makes sense.  If IPYTHON_OPTS is present, use ipython.

Verify Spark with ipython notebook

At this point, you should be able to create a new notebook and execute some python using the provided SparkContext.  For example:

print sc

or

print sc.version

Here’s a screencast of running ipython notebook with pyspark on my laptop

ipython notebook spark example

In this screencast, pay special attention to your terminal window log statements.  At the default log level of INFO, you should see the no errors in pyspark output.  Also, when you start a new notebook, the terminal should show SparkContext sc being available for use, such as

INFO SparkContext: Running Spark version

Why use ipython notebook with Spark?

1) Same reasons you use ipython notebook without Spark such as convenience, easy to share and execute notebooks, etc.

2) Code completion.  As the screencast shows, a python spark developer can hit the tab key for available functions or also known as code completion options.

Hope this helps, let me know if you have any questions.  To continue with Python in Spark, check out the Spark Transformations in Python and Spark Actions in Python tutorials.

Next tutorial with ipython is ipython with a Spark Cluster.

2 thoughts on “Apache Spark and ipython notebook – The Easy Way

  1. For those interested wanting to run pyspark using Jupyter notebooks (the successor of IPython notebooks), you can use:

    PYSPARK_DRIVER_PYTHON=”jupyter” PYSPARK_DRIVER_PYTHON_OPTS=”notebook” pyspark

Leave a Reply

Your email address will not be published. Required fields are marked *