When your PySpark application is ready to deploy to production or to a pre-prod testing environment how do we do it? How do we deploy Python programs to a Spark Cluster?
The short answer is, it depends on complexity of your PySpark application. Does the logic of your Python application reside all in one .py file? Or does it have external dependencies needed in order to run?
In this post, let’s cover a few examples of both when deploying your PySpark application to a Spark cluster.
Table of Contents
- PySpark Spark Submit Overview
- PySpark Application Deploy Overview
- PySpark with dependencies Deploy
- PySpark spark-submit Frequently Asked Questions
- Additional Resources
PySpark Spark Submit Overview
Deploying a Spark program, whether PySpark or Scala/Java based uses the
spark-submit script. Again, this is true if your app is written in Python or Scala. A deep-dive on spark-submit was already covered on this site, so please reference it if you need a refresh or interested.
PySpark Application Deploy Overview
Let’s deploy a couple of examples of Spark PySpark program to our cluster. Let’s start with a simple example and then progress to more complicated examples which include utilizing spark-packages and PySpark SQL.
Ok, now that we’ve deployed a few examples as shown in the above screencast, let’s review a Python program which utilizes code we’ve already seen in this Spark with Python tutorials on this site. It’s a Python program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.
When deploying our driver program, we need to do things differently than we have while working with pyspark.
For example, we need to obtain a SparkContext and SQLContext. We need to specify Python imports.
Quick note: see SparkContext vs. SparkSession vs. SQLContext for more background here. These examples using SparkContext indicate an older version of PySpark.Version alert
bin/spark-submit --master spark://macbook-pro.local:7077 --packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv
Let’s return to the Spark UI now we have an available worker in the cluster and we have deployed some Python programs.
The Spark UI is the tool for Spark Cluster diagnostics, so we’ll review the key attributes of the tool.
As mentioned above, these screencasts are a few years old, but the mechanics are still the same. At this point in 2023, it is worth noting the –py-files option though. We’ll cover it in the next section.
PySpark with dependencies Deploy
In the preceding example, we deployed a simple PySpark program. But what about PySpark applications with multiple Python files? What if you packaged your PySpark program into a .zip or .egg file?
In this case, you’ll want to use the –py-files option in
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key<=<value> \
--driver-memory <value> \
--executor-memory <value> \
--executor-cores <number of cores> \
--py-files file1.py,file2.py,uber.egg \
For more info on Python Package Management in Spark see, Python Package Management from Spark docs.
PySpark Spark Submit Arguments
PySpark spark-submit Frequently Asked Questions
pyspark vs spark-submit
It should be obvious by now, but PySpark and spark-submit are two very different things.
PySpark is the Python API for Apache Spark. PySpark provides a Python wrapper around Spark’s core functionalities, including Spark SQL, Spark Streaming, and MLlib.
On the other hand,
spark-submit is a command-line tool that allows you to submit Spark applications to a cluster. It takes care of launching the driver program on the cluster, setting up the necessary environment, and running the application. As we now know,
spark-submit can be used to submit applications written in various programming languages, including Python, Java, and Scala.
Pyspark spark-submit arguments
There are two ways to answer the question of how to add arguments to
spark-submit when deploying a PySpark application.
- If you want to set Spark related arguments, just follow the regular spark-submit mechanics such as “conf”, “driver-member”, etc. as shown above and explored in more detail in the spark-submit tutorial.
- If you want to pass arguments to the PySpark program itself just add to the end of the spark-submit command such as shown above but expanded for illustrative purposes here:
--master yarn \
--deploy-mode cluster \
--driver-memory 8g \
--executor-memory 16g \
--executor-cores 2 \
--py-files file1.py,file2,upy.egg \
In this example, we are passing in an argument of the CSV file to use.
Be sure to check out more tutorials such as