Deploy PySpark to a Spark Cluster with spark-submit [3 Examples]

July 18, 2023July 18, 2020 by Todd M

When your PySpark application is ready to deploy to production or to a pre-prod testing environment how do we do it? How do we deploy Python programs to a Spark Cluster?

The short answer is, it depends on complexity of your PySpark application. Does the logic of your Python application reside all in one .py file? Or does it have external dependencies needed in order to run?

In this post, let’s cover a few examples of both when deploying your PySpark application to a Spark cluster.

Table of Contents

PySpark Spark Submit Overview
PySpark Application Deploy Overview
PySpark with dependencies Deploy

PySpark spark-submit Frequently Asked Questions
- pyspark vs spark-submit
- Pyspark spark-submit arguments
Additional Resources

PySpark Spark Submit Overview

Deploying a Spark program, whether PySpark or Scala/Java based uses the spark-submit script. Again, this is true if your app is written in Python or Scala. A deep-dive on spark-submit was already covered on this site, so please reference it if you need a refresh or interested.

PySpark spark-submit examples

PySpark Application Deploy Overview

Let’s deploy a couple of examples of Spark PySpark program to our cluster. Let’s start with a simple example and then progress to more complicated examples which include utilizing spark-packages and PySpark SQL.

Ok, now that we’ve deployed a few examples as shown in the above screencast, let’s review a Python program which utilizes code we’ve already seen in this Spark with Python tutorials on this site. It’s a Python program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.

When deploying our driver program, we need to do things differently than we have while working with pyspark.

For example, we need to obtain a SparkContext and SQLContext. We need to specify Python imports.

Quick note: see SparkContext vs. SparkSession vs. SQLContext for more background here. These examples using SparkContext indicate an older version of PySpark.
Version alert

bin/spark-submit --master spark://macbook-pro.local:7077 --packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv

Let’s return to the Spark UI now we have an available worker in the cluster and we have deployed some Python programs.

The Spark UI is the tool for Spark Cluster diagnostics, so we’ll review the key attributes of the tool.

As mentioned above, these screencasts are a few years old, but the mechanics are still the same. At this point in 2023, it is worth noting the –py-files option though. We’ll cover it in the next section.

PySpark with dependencies Deploy

In the preceding example, we deployed a simple PySpark program. But what about PySpark applications with multiple Python files? What if you packaged your PySpark program into a .zip or .egg file?

In this case, you’ll want to use the –py-files option in spark-submit.

For example

./bin/spark-submit \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key<=<value> \
  --driver-memory <value> \
  --executor-memory <value> \
  --executor-cores <number of cores>  \
  --py-files file1.py,file2.py,uber.egg \
  uberstats.py

For more info on Python Package Management in Spark see, Python Package Management from Spark docs.

PySpark Spark Submit Arguments

PySpark spark-submit Frequently Asked Questions

pyspark vs spark-submit

It should be obvious by now, but PySpark and spark-submit are two very different things.

PySpark is the Python API for Apache Spark. PySpark provides a Python wrapper around Spark’s core functionalities, including Spark SQL, Spark Streaming, and MLlib.

On the other hand, spark-submit is a command-line tool that allows you to submit Spark applications to a cluster. It takes care of launching the driver program on the cluster, setting up the necessary environment, and running the application. As we now know, spark-submit can be used to submit applications written in various programming languages, including Python, Java, and Scala.

Pyspark spark-submit arguments

There are two ways to answer the question of how to add arguments to spark-submit when deploying a PySpark application.

If you want to set Spark related arguments, just follow the regular spark-submit mechanics such as “conf”, “driver-member”, etc. as shown above and explored in more detail in the spark-submit tutorial.
If you want to pass arguments to the PySpark program itself just add to the end of the spark-submit command such as shown above but expanded for illustrative purposes here:

./bin/spark-submit \
   --verbose
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --py-files file1.py,file2,upy.egg \
   uberstats.py Uber-Jan-Feb-FOIL.csv

In this example, we are passing in an argument of the CSV file to use.

Additional Resources

Be sure to check out more tutorials such as

PySpark Tutorials

Spark Tutorials

See also PySpark Quick Start [Introduction to Apache Spark for Python Developers]

About Todd M

Todd has held multiple software roles over his 20 year career. For the last 5 years, he has focused on helping organizations move from batch to data streaming. In addition to the free tutorials, he provides consulting, coaching for Data Engineers, Data Scientists, and Data Architects. Feel free to reach out directly or to connect on LinkedIn

...

3 thoughts on “Deploy PySpark to a Spark Cluster with spark-submit [3 Examples]”

Johnny Chiu

February 21, 2016 at 9:23 am

Hi Todd,

I have followed along your detailed tutorial trying to deployed python program to a spark cluster. I have tried deployed to Standalone Mode, and it went out successfully. However, when I tried to run it on EC2, I got ” WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources”.

to Standalone: bin/spark-submit –master spark://qiushiquandeMacBook-Pro.local:7077 examples/src/main/python/pi.py
to EC2: bin/spark-submit –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077 examples/src/main/python/pi.py

In standalone spark UI:
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 7.0 GB Total, 0.0 B Used
Applications: 0 Running, 5 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE

In EC2 spark UI:
Alive Workers: 1
Cores in use: 2 Total, 0 Used
Memory in use: 6.3 GB Total, 0.0 B Used
Applications: 0 Running, 8 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE

It seem to be a common issue in Spark for new users, but I still don’t have idea how to solve this issue.
Could you suggest me any possible reasons for this issue? I am appreciated with any suggestions. Thanks!
Reply
- Johnny Chiu
  
  February 21, 2016 at 4:35 pm
  
  Hi Todd,
  
  Thanks for the suggestion. The EC2 tutorial has been helpful. Port 7070 is opened and I am able to connect to cluster via Pyspark. I still got the Warning message though. I will try to figure it out.
  
  Thanks a lot! I am appreciated.
  
  Best,
  Johnny
  Reply
Akash Agte

April 29, 2018 at 8:36 pm

Hello Todd,
I tried using the following command to test a Spark program however I am getting an error. Does it have something to do with the “global visibility” factor?
Reply

Leave a Comment Cancel reply