
After you have a Spark cluster running, how do you deploy Python programs to a Spark Cluster? It’s not as straightforward as you might think or hope, so let’s explore further in this PySpark tutorial.
PySpark Application Deploy Overview
Let’s deploy a couple of examples of Spark PySpark program to our cluster. Let’s start with a simple example and then progress to more complicated examples which include utilizing spark-packages and PySpark SQL.
Ok, now that we’ve deployed a few examples as shown in the above screencast, let’s review a Python program which utilizes code we’ve already seen in this Spark with Python tutorials on this site. It’s a Python program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.
When deploying our driver program, we need to do things differently than we have while working with pyspark. For example, we need to obtain a SparkContext and SQLContext. We need to specify Python imports.
bin/spark-submit –master spark://todd-mcgraths-macbook-pro.local:7077 –packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv
Let’s return to the Spark UI now we have an available worker in the cluster and we have deployed some Python programs.
The Spark UI is the tool for Spark Cluster diagnostics, so we’ll review the key attributes of the tool.
If you find these videos of deploying Python programs to an Apache Spark cluster interesting, you will find the entire Apache Spark with Python Course valuable. Make sure to check it out.
Additional Spark Python Resources
Featured Image credit https://flic.kr/p/bpd8Ht
Hi Todd,
I have followed along your detailed tutorial trying to deployed python program to a spark cluster. I have tried deployed to Standalone Mode, and it went out successfully. However, when I tried to run it on EC2, I got ” WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources”.
to Standalone: bin/spark-submit –master spark://qiushiquandeMacBook-Pro.local:7077 examples/src/main/python/pi.py
to EC2: bin/spark-submit –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077 examples/src/main/python/pi.py
In standalone spark UI:
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 7.0 GB Total, 0.0 B Used
Applications: 0 Running, 5 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
In EC2 spark UI:
Alive Workers: 1
Cores in use: 2 Total, 0 Used
Memory in use: 6.3 GB Total, 0.0 B Used
Applications: 0 Running, 8 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
It seem to be a common issue in Spark for new users, but I still don’t have idea how to solve this issue.
Could you suggest me any possible reasons for this issue? I am appreciated with any suggestions. Thanks!
Hi Johny,
Maybe port 7070 is not open on your Spark cluster on EC2? Are you able to connect to the cluster via pyspark? And I’m assuming you’ve went through all steps here https://supergloo.com/spark/apache-spark-cluster-amazon-ec2-tutorial/
Todd
Hi Todd,
Thanks for the suggestion. The EC2 tutorial has been helpful. Port 7070 is opened and I am able to connect to cluster via Pyspark. I still got the Warning message though. I will try to figure it out.
Thanks a lot! I am appreciated.
Best,
Johnny
Hello Todd,
I tried using the following command to test a Spark program however I am getting an error. Does it have something to do with the “global visibility” factor?
Hello Todd,
I tried using the following command to test a Spark program however I am getting an error. Does it have something to do with the “global visibility” factor? https://uploads.disquscdn.com/images/656810040871324cb2dc754723aa81b082361b3dd59cee5a38166e05170ff609.png