How to Deploy Python Programs to a Spark Cluster

Python Program Deploy to Spark Cluster

After you have a Spark cluster running, how do you deploy Python programs to a Spark Cluster?  If you find these videos of deploying Python programs to an Apache Spark cluster interesting, you will find the entire Apache Spark with Course valuable.  Make sure to check it out.

In this post, we’ll deploy a couple of example Python programs. We’ll start with a simple example and then progress to more complicated examples which include utilizing spark-packages and Spark SQL.

Deploy Python programs to Spark Cluster Part 1

Ok, now that we’ve deployed a few examples, let’s review a Python program which utilizes code we’ve already seen in this Spark with Python tutorials on this site. It’s a Python program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.

When deploying our driver program, we need to do things differently than we have while working with pyspark. For example, we need to obtain a SparkContext and SQLContext. We need to specific Python imports.

bin/spark-submit –master spark://todd-mcgraths-macbook-pro.local:7077 –packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv

Deploy Python program to Spark Cluster Part 2

Let’s return to the Spark UI now we have an available worker in the cluster and we have deployed some Python programs.

The Spark UI is the tool for Spark Cluster diagnostics, so we’ll review the key attributes of the tool.

Deploy Python program to Spark Cluster Part 3 – Spark UI

 

Featured Image credit https://flic.kr/p/bpd8Ht

 

4 thoughts on “How to Deploy Python Programs to a Spark Cluster

  1. Hi Todd,

    I have followed along your detailed tutorial trying to deployed python program to a spark cluster. I have tried deployed to Standalone Mode, and it went out successfully. However, when I tried to run it on EC2, I got ” WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources”.

    to Standalone: bin/spark-submit –master spark://qiushiquandeMacBook-Pro.local:7077 examples/src/main/python/pi.py
    to EC2: bin/spark-submit –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077 examples/src/main/python/pi.py

    In standalone spark UI:
    Alive Workers: 1
    Cores in use: 4 Total, 0 Used
    Memory in use: 7.0 GB Total, 0.0 B Used
    Applications: 0 Running, 5 Completed
    Drivers: 0 Running, 0 Completed
    Status: ALIVE

    In EC2 spark UI:
    Alive Workers: 1
    Cores in use: 2 Total, 0 Used
    Memory in use: 6.3 GB Total, 0.0 B Used
    Applications: 0 Running, 8 Completed
    Drivers: 0 Running, 0 Completed
    Status: ALIVE

    It seem to be a common issue in Spark for new users, but I still don’t have idea how to solve this issue.
    Could you suggest me any possible reasons for this issue? I am appreciated with any suggestions. Thanks!

      1. Hi Todd,

        Thanks for the suggestion. The EC2 tutorial has been helpful. Port 7070 is opened and I am able to connect to cluster via Pyspark. I still got the Warning message though. I will try to figure it out.

        Thanks a lot! I am appreciated.

        Best,
        Johnny

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.