How to Deploy Python Programs to a Spark Cluster

Python Program Deploy to Spark Cluster

After you have a Spark cluster running, how do you deploy Python programs to a Spark Cluster?  It’s not as straightforward as you might think or hope, so let’s explore further in this PySpark tutorial.

PySpark Application Deploy Overview

Let’s deploy a couple of examples of Spark PySpark program to our cluster. Let’s start with a simple example and then progress to more complicated examples which include utilizing spark-packages and PySpark SQL.

Ok, now that we’ve deployed a few examples as shown in the above screencast, let’s review a Python program which utilizes code we’ve already seen in this Spark with Python tutorials on this site. It’s a Python program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.

When deploying our driver program, we need to do things differently than we have while working with pyspark. For example, we need to obtain a SparkContext and SQLContext. We need to specify Python imports.

bin/spark-submit –master spark://todd-mcgraths-macbook-pro.local:7077 –packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv

Let’s return to the Spark UI now we have an available worker in the cluster and we have deployed some Python programs.

The Spark UI is the tool for Spark Cluster diagnostics, so we’ll review the key attributes of the tool.

If you find these videos of deploying Python programs to an Apache Spark cluster interesting, you will find the entire Apache Spark with Python Course valuable.  Make sure to check it out.

Additional Spark Python Resources

PySpark Tutorials

See also  PySpark Transformations in Python Examples

Spark Tutorial

Featured Image credit https://flic.kr/p/bpd8Ht

5 thoughts on “How to Deploy Python Programs to a Spark Cluster

  1. Hi Todd,

    I have followed along your detailed tutorial trying to deployed python program to a spark cluster. I have tried deployed to Standalone Mode, and it went out successfully. However, when I tried to run it on EC2, I got ” WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources”.

    to Standalone: bin/spark-submit –master spark://qiushiquandeMacBook-Pro.local:7077 examples/src/main/python/pi.py
    to EC2: bin/spark-submit –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077 examples/src/main/python/pi.py

    In standalone spark UI:
    Alive Workers: 1
    Cores in use: 4 Total, 0 Used
    Memory in use: 7.0 GB Total, 0.0 B Used
    Applications: 0 Running, 5 Completed
    Drivers: 0 Running, 0 Completed
    Status: ALIVE

    In EC2 spark UI:
    Alive Workers: 1
    Cores in use: 2 Total, 0 Used
    Memory in use: 6.3 GB Total, 0.0 B Used
    Applications: 0 Running, 8 Completed
    Drivers: 0 Running, 0 Completed
    Status: ALIVE

    It seem to be a common issue in Spark for new users, but I still don’t have idea how to solve this issue.
    Could you suggest me any possible reasons for this issue? I am appreciated with any suggestions. Thanks!

      1. Hi Todd,

        Thanks for the suggestion. The EC2 tutorial has been helpful. Port 7070 is opened and I am able to connect to cluster via Pyspark. I still got the Warning message though. I will try to figure it out.

        Thanks a lot! I am appreciated.

        Best,
        Johnny

  2. Hello Todd,
    I tried using the following command to test a Spark program however I am getting an error. Does it have something to do with the “global visibility” factor?

Leave a Reply

Your email address will not be published. Required fields are marked *