Use PySpark joins with SQL to compare, and possibly combine, data from two or more datasources based on matching field values. This is simply called “joins” in many cases and usually the datasources are tables from a database or flat file sources, but more often than not, the data sources are becoming Kafka topics. Regardless of data source, it is critical to understand there are many different types of joins. The best join type to use will depend on both the desired outcome, as well as the join type most optimal for performance, because sometimes joins can be an expensive operation.
The concept and implementation of joins functionality is something which predates Spark and PySpark. It has origins in relational algebra, but it is often associated with SQL and running joins against databases. The American National Standards Institute or ANSI-standard SQL specifies five types of joins: INNER
, LEFT OUTER
, RIGHT OUTER
, FULL OUTER
and CROSS
. These types of joins can be achieved in PySpark SQL in two primary ways. In this tutorial we will explore using SQL from PySpark. Let’s explore all the different types of PySpark with SQL joins with examples. The other approach is to use the DataFrame join
function within PySpark when constructing the JOIN type. Using the PySpark DataFrame join
function is covered in a separate tutorial with a link below in the Resources section.
Table of Contents
- PySpark Joins with SQL Overview
- PySpark Join with SQL Examples Initial Setup
- PySpark Join Type Examples
PySpark Joins with SQL Overview
Before we begin all the examples, let’s confirm your understanding of a few key points about SQL joins in PySpark.
First, the type of join is specified within the SQL Statement. The available options of join type string values include [ INNER ] | CROSS | LEFT [ OUTER ] | [ LEFT ] SEMI | RIGHT [ OUTER ] | FULL [ OUTER ] | [ LEFT ] ANTI.
The default join type is INNER
.
Second, there is some overlap in name types for convenience; for example, either LEFT or LEFT OUTER
may be used and will produce the exact same outcome. Again, this is simply for convenience.
PySpark SQL Joins Deep Dive
The signature of SQL Joins is:
relation { [ join_type ] JOIN relation [ join_criteria ] | NATURAL join_type JOIN relation }
The parameter arguments include:
- parameter
relation
: Specifies relation to be joined - parameter
join_type
: Specifies the join type; i.e. INNER, LEFT, RIGHT, etc. - parameter
join_criteria
: Specifies how the rows from one relation will be combined with the rows of another relation using boolean criteria with syntax of ON boolean_expression | USING ( column_name [ , … ] )
I realize this might not be helpful the first time you read this section, but come back to it after going through some examples below.
PySpark Join with SQL Examples Initial Setup
We will use the PySpark shell and to explore joins. This will keep things simple, concise, and focus on the different outcomes of join types. An alternative to this approach would be using the spark-sql
shell directly, but in this tutorial we are focusing on using SQL with PySpark.
The following DataFrames will be required to complete all the tutorial examples.
products = [
(1,"Syrup - Golden, Lyles","41-889-0877",4,30.95), \
(2,"Huck White Towels","10-857-2683",21,11.13), \
(3,"Pasta - Lasagna Noodle, Frozen","08-151-1046",2,22.72), \
(4,"Fiddlehead - Frozen","15-125-2352",1,22.66), \
(5,"Juice - Clamato, 341 Ml","40-753-5219",2,37.72), \
(6,"Lamb - Racks, Frenched","52-656-0114",2,32.78), \
(7,"Beer - Alexander Kieths, Pale Ale","79-864-2525",18,20.73), \
(8,"Oil - Avocado","41-264-0597",4,11.71), \
(9,"Juice - V8, Tomato","47-401-5889",8,13.05), \
(10,"Lotus Rootlets - Canned","12-923-5239",5,39.76), \
(11,"Oats Large Flake","70-628-9900",2,12.57), \
(12,"Cheese - Brie,danish","33-116-0464",1,5.00), \
(13,"Bread - Pullman, Sliced","67-046-6746",1,7.52), \
(14,"Lettuce - Green Leaf","77-181-3088",3,19.16), \
(15,"Creamers - 10%","81-764-7420",7,36.32) \
]
productColumns = ["id","name","sku","category_id", "current_price" ]
productCategories = [
(1,"Good Times?"), \
(2,"Let's Eat!"), \
(3,"Big Time TV Show Snack"), \
(4,"Not Vegan"), \
(5,"More carnivore") \
]
productCategoryColumns = ["id","name" ]
cat = spark.createDataFrame(data=productCategories,schema=productCategoryColumns)
prod = spark.createDataFrame(data=products, schema=productColumns)
prod.createTempView("products")
cat.createTempView("categories")
spark.sql("select * from products").show()
This is an example of what the above code looks like in the pyspark shell. If you are going to run the examples below, then all these steps need to be verified.
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Python version 3.9.12 (main, Mar 26 2022 15:51:15)
Spark context Web UI available at http://192.168.1.15:4040
Spark context available as 'sc' (master = local[*], app id = local-1666813863341).
SparkSession available as 'spark'.
>>> products = [
... (1,"Syrup - Golden, Lyles","41-889-0877",4,30.95), \
... (2,"Huck White Towels","10-857-2683",21,11.13), \
... (3,"Pasta - Lasagna Noodle, Frozen","08-151-1046",2,22.72), \
... (4,"Fiddlehead - Frozen","15-125-2352",1,22.66), \
... (5,"Juice - Clamato, 341 Ml","40-753-5219",2,37.72), \
... (6,"Lamb - Racks, Frenched","52-656-0114",2,32.78), \
... (7,"Beer - Alexander Kieths, Pale Ale","79-864-2525",18,20.73), \
... (8,"Oil - Avocado","41-264-0597",4,11.71), \
... (9,"Juice - V8, Tomato","47-401-5889",8,13.05), \
... (10,"Lotus Rootlets - Canned","12-923-5239",5,39.76), \
... (11,"Oats Large Flake","70-628-9900",2,12.57), \
... (12,"Cheese - Brie,danish","33-116-0464",1,5.00), \
... (13,"Bread - Pullman, Sliced","67-046-6746",1,7.52), \
... (14,"Lettuce - Green Leaf","77-181-3088",3,19.16), \
... (15,"Creamers - 10%","81-764-7420",7,36.32) \
... ]
>>> productColumns = ["id","name","sku","category_id", "current_price" ]
>>>
>>> productCategories = [
... (1,"Good Times?"), \
... (2,"Let's Eat!"), \
... (3,"Big Time TV Show Snack"), \
... (4,"Not Vegan"), \
... (5,"More carnivore") \
... ]
>>> productCategoryColumns = ["id","name" ]
>>> cat = spark.createDataFrame(data=productCategories,schema=productCategoryColumns)
>>> prod = spark.createDataFrame(data=products, schema=productColumns)
>>> prod.createTempView("products")
>>> cat.createTempView("cat")
>>> spark.sql("select * from products").show()
+---+--------------------+-----------+-----------+-------------+
| id| name| sku|category_id|current_price|
+---+--------------------+-----------+-----------+-------------+
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95|
| 2| Huck White Towels|10-857-2683| 21| 11.13|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72|
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78|
| 7|Beer - Alexander ...|79-864-2525| 18| 20.73|
| 8| Oil - Avocado|41-264-0597| 4| 11.71|
| 9| Juice - V8, Tomato|47-401-5889| 8| 13.05|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76|
| 11| Oats Large Flake|70-628-9900| 2| 12.57|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16|
| 15| Creamers - 10%|81-764-7420| 7| 36.32|
+---+--------------------+-----------+-----------+-------------+
The sample data of food products looks legit, but the food categories are values are not really food categories. You may have noticed it. I made up the food category values to have some fun and because I have been the starring actor of big time TV shows before. Yep, that’s me, but anyhow, let’s back to the examples.
PySpark Join Type Examples
Inner Join
An inner join is the default join and selects rows having matching values in both relations
>>> spark.sql("select * from products inner join cat on products.category_id = cat.id").show()
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| id| name| sku|category_id|current_price| id| name|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66| 1| Good Times?|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0| 1| Good Times?|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52| 1| Good Times?|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72| 2| Let's Eat!|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72| 2| Let's Eat!|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78| 2| Let's Eat!|
| 11| Oats Large Flake|70-628-9900| 2| 12.57| 2| Let's Eat!|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16| 3|Big Time TV Show ...|
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95| 4| Not Vegan|
| 8| Oil - Avocado|41-264-0597| 4| 11.71| 4| Not Vegan|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76| 5| More carnivore|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
or we can order specific rows in the result set as shown in the following
>>> spark.sql("select * from products inner join cat on products.category_id = cat.id order by products.id").show()
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| id| name| sku|category_id|current_price| id| name|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95| 4| Not Vegan|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72| 2| Let's Eat!|
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66| 1| Good Times?|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72| 2| Let's Eat!|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78| 2| Let's Eat!|
| 8| Oil - Avocado|41-264-0597| 4| 11.71| 4| Not Vegan|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76| 5| More carnivore|
| 11| Oats Large Flake|70-628-9900| 2| 12.57| 2| Let's Eat!|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0| 1| Good Times?|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52| 1| Good Times?|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16| 3|Big Time TV Show ...|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
Notice missing products with IDs of 2, 7, 9 and 15 because there are no matching categories for the category_id.
Full Outer Join
A full outer join (or any of the following outer
,full
,fullouter
,full_outer
, see table above) returns all values from both relations, appending NULL values on the side that does not have a match.
>>> spark.sql("select * from products full outer join cat on products.category_id = cat.id order by products.id").show()
+---+--------------------+-----------+-----------+-------------+----+--------------------+
| id| name| sku|category_id|current_price| id| name|
+---+--------------------+-----------+-----------+-------------+----+--------------------+
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95| 4| Not Vegan|
| 2| Huck White Towels|10-857-2683| 21| 11.13|null| null|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72| 2| Let's Eat!|
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66| 1| Good Times?|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72| 2| Let's Eat!|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78| 2| Let's Eat!|
| 7|Beer - Alexander ...|79-864-2525| 18| 20.73|null| null|
| 8| Oil - Avocado|41-264-0597| 4| 11.71| 4| Not Vegan|
| 9| Juice - V8, Tomato|47-401-5889| 8| 13.05|null| null|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76| 5| More carnivore|
| 11| Oats Large Flake|70-628-9900| 2| 12.57| 2| Let's Eat!|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0| 1| Good Times?|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52| 1| Good Times?|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16| 3|Big Time TV Show ...|
| 15| Creamers - 10%|81-764-7420| 7| 36.32|null| null|
+---+--------------------+-----------+-----------+-------------+----+--------------------+
Now “null” values appear in the rows for categories that don’t exist. Compare that to previous inner join example if it isn’t clear.
Left Outer Join | Left Join
A left join (or any of the following left
, leftouter
, left_outer
, again, see table above) returns all values from left relation and the right relation matched values, or appends NULL if there is no match.
>>> spark.sql("select * from products full outer join cat on products.category_id = cat.id order by products.id").show()
+---+--------------------+-----------+-----------+-------------+----+--------------------+
| id| name| sku|category_id|current_price| id| name|
+---+--------------------+-----------+-----------+-------------+----+--------------------+
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95| 4| Not Vegan|
| 2| Huck White Towels|10-857-2683| 21| 11.13|null| null|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72| 2| Let's Eat!|
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66| 1| Good Times?|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72| 2| Let's Eat!|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78| 2| Let's Eat!|
| 7|Beer - Alexander ...|79-864-2525| 18| 20.73|null| null|
| 8| Oil - Avocado|41-264-0597| 4| 11.71| 4| Not Vegan|
| 9| Juice - V8, Tomato|47-401-5889| 8| 13.05|null| null|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76| 5| More carnivore|
| 11| Oats Large Flake|70-628-9900| 2| 12.57| 2| Let's Eat!|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0| 1| Good Times?|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52| 1| Good Times?|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16| 3|Big Time TV Show ...|
| 15| Creamers - 10%|81-764-7420| 7| 36.32|null| null|
+---+--------------------+-----------+-----------+-------------+----+--------------------+
Right Outer Join | Right Join
A right outer join (or any of the following right
, rightouter
, right_outer
) returns all values from the right relation and the matched values from the left relation, or appends NULL if there is no match.
>>> spark.sql("select * from products right join cat on products.category_id = cat.id order by products.id").show()
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| id| name| sku|category_id|current_price| id| name|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95| 4| Not Vegan|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72| 2| Let's Eat!|
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66| 1| Good Times?|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72| 2| Let's Eat!|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78| 2| Let's Eat!|
| 8| Oil - Avocado|41-264-0597| 4| 11.71| 4| Not Vegan|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76| 5| More carnivore|
| 11| Oats Large Flake|70-628-9900| 2| 12.57| 2| Let's Eat!|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0| 1| Good Times?|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52| 1| Good Times?|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16| 3|Big Time TV Show ...|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
but compare it with the following where we switch the right and left tables (DataFrames) in the join.
>>> spark.sql("select * from cat right join products on cat.id = products.category_id order by products.id").show()
+----+--------------------+---+--------------------+-----------+-----------+-------------+
| id| name| id| name| sku|category_id|current_price|
+----+--------------------+---+--------------------+-----------+-----------+-------------+
| 4| Not Vegan| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95|
|null| null| 2| Huck White Towels|10-857-2683| 21| 11.13|
| 2| Let's Eat!| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72|
| 1| Good Times?| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66|
| 2| Let's Eat!| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72|
| 2| Let's Eat!| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78|
|null| null| 7|Beer - Alexander ...|79-864-2525| 18| 20.73|
| 4| Not Vegan| 8| Oil - Avocado|41-264-0597| 4| 11.71|
|null| null| 9| Juice - V8, Tomato|47-401-5889| 8| 13.05|
| 5| More carnivore| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76|
| 2| Let's Eat!| 11| Oats Large Flake|70-628-9900| 2| 12.57|
| 1| Good Times?| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0|
| 1| Good Times?| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52|
| 3|Big Time TV Show ...| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16|
|null| null| 15| Creamers - 10%|81-764-7420| 7| 36.32|
+----+--------------------+---+--------------------+-----------+-----------+-------------+
Cross Join
When looking for the Cartesian product of two relations being compared use a cross
join.
>>> spark.sql("select * from products cross join cat on products.category_id = cat.id order by products.id").show()
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| id| name| sku|category_id|current_price| id| name|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95| 4| Not Vegan|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72| 2| Let's Eat!|
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66| 1| Good Times?|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72| 2| Let's Eat!|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78| 2| Let's Eat!|
| 8| Oil - Avocado|41-264-0597| 4| 11.71| 4| Not Vegan|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76| 5| More carnivore|
| 11| Oats Large Flake|70-628-9900| 2| 12.57| 2| Let's Eat!|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0| 1| Good Times?|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52| 1| Good Times?|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16| 3|Big Time TV Show ...|
+---+--------------------+-----------+-----------+-------------+---+--------------------+
>>> spark.sql("select * from cat cross join products on cat.id = products.category_id order by products.id").show()
+---+--------------------+---+--------------------+-----------+-----------+-------------+
| id| name| id| name| sku|category_id|current_price|
+---+--------------------+---+--------------------+-----------+-----------+-------------+
| 4| Not Vegan| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95|
| 2| Let's Eat!| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72|
| 1| Good Times?| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66|
| 2| Let's Eat!| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72|
| 2| Let's Eat!| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78|
| 4| Not Vegan| 8| Oil - Avocado|41-264-0597| 4| 11.71|
| 5| More carnivore| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76|
| 2| Let's Eat!| 11| Oats Large Flake|70-628-9900| 2| 12.57|
| 1| Good Times?| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0|
| 1| Good Times?| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52|
| 3|Big Time TV Show ...| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16|
+---+--------------------+---+--------------------+-----------+-----------+-------------+
Anti Join
An anti join returns values having no match with the right join. This is convenient way to find rows without matches when you are expecting matches for all rows. It is also commonly referred to as a “left anti join”.
>>> spark.sql("select * from products anti join cat on products.category_id = cat.id order by products.id").show()
+---+--------------------+-----------+-----------+-------------+
| id| name| sku|category_id|current_price|
+---+--------------------+-----------+-----------+-------------+
| 2| Huck White Towels|10-857-2683| 21| 11.13|
| 7|Beer - Alexander ...|79-864-2525| 18| 20.73|
| 9| Juice - V8, Tomato|47-401-5889| 8| 13.05|
| 15| Creamers - 10%|81-764-7420| 7| 36.32|
+---+--------------------+-----------+-----------+-------------+
It was already mentioned, but it’s worth repeating for emphasis. See how how there are no fields from category DataFrame returned in the results above?
Semi Join
A semi join (or any of the following the table above including semi
,leftsemi
,left_semi
) will return values from the left side only which match with the right. It is also referred to as a “left semi join”.
>>> spark.sql("select * from products semi join cat on products.category_id = cat.id order by products.id").show()
+---+--------------------+-----------+-----------+-------------+
| id| name| sku|category_id|current_price|
+---+--------------------+-----------+-----------+-------------+
| 1|Syrup - Golden, L...|41-889-0877| 4| 30.95|
| 3|Pasta - Lasagna N...|08-151-1046| 2| 22.72|
| 4| Fiddlehead - Frozen|15-125-2352| 1| 22.66|
| 5|Juice - Clamato, ...|40-753-5219| 2| 37.72|
| 6|Lamb - Racks, Fre...|52-656-0114| 2| 32.78|
| 8| Oil - Avocado|41-264-0597| 4| 11.71|
| 10|Lotus Rootlets - ...|12-923-5239| 5| 39.76|
| 11| Oats Large Flake|70-628-9900| 2| 12.57|
| 12|Cheese - Brie,danish|33-116-0464| 1| 5.0|
| 13|Bread - Pullman, ...|67-046-6746| 1| 7.52|
| 14|Lettuce - Green Leaf|77-181-3088| 3| 19.16|
+---+--------------------+-----------+-----------+-------------+
If you found these examples helpful, please consider sharing this page on your social networks such as LinkedIn, Reddit, etc. to help spread the word.