This post will cover how to connect an ipython notebook to two kinds of Spark Clusters: Spark Cluster running in Standalone mode and a Spark Cluster running on Amazon EC2.
You need to have a Spark Cluster Standalone and Apache Spark Cluster running to complete this tutorial. See the Background section of this post for further information and helpful references.
Connecting ipython notebook to an Apache Spark Standalone Cluster
Connecting to the Spark Cluster from ipython notebook is easy. Simply set the master environment variable when calling pyspark, for example:
IPYTHON_OPTS=”notebook” ./bin/pyspark –master spark://todd-mcgraths-macbook-pro.local:7077
Run a version or some function off of sc. There’s really know way I know of to programmatically determine if we are truly running ipython notebook against the Spark cluster. But, we can verify from the Spark Web UI:
Connecting an ipython notebook to an Apache Spark Cluster running on EC2
Using pyspark against a remote cluster is just as easy. Just pass in the appropriate URL to the –master argument.
IPYTHON_OPTS=”notebook” ./bin/pyspark –master spark://ec2-54-198-139-10.compute-1.amazonaws.com:7077
As you saw in this tutorial, connecting to a standalone cluster or spark cluster running on EC2 is essentially the same. It’s easy. The difficult part of connecting to a Spark cluster happens beforehand. Check the next section on Background Information to help setup your Apache Spark Cluster and/or connection ipython notebook to a spark cluster.
Background Information or Possibly Helpful References
1) How to use ipython notebook with Spark: Apache Spark and ipython notebook – The Easy Way
2) Apache Spark Cluster in Standalone tutorial, you learned how to run a Spark Standalone cluster. In addition, you learned how to connect the Scala console to utilize this cluster.
Featured Image: https://flic.kr/p/5dBco