Spark Broadcast Variables When and Why

Spark Broadcast Variables

Apache Spark broadcast variables are available to all nodes in the cluster. They are used to cache a value in memory on all nodes, so it can be efficiently accessed by tasks running on those nodes.

For example, broadcast variables are useful with large values needing to be used in each Spark task.

By using a broadcast variable, you can avoid having to send the value to each task over the network, which can improve the performance of your Spark job.

Broadcast variables are read-only and cannot be modified once created.

Instead, if you need to update the value of a broadcast variable, new broadcast variable will need to be created and referenced instead.

The code examples in this tutorial assume you know PySpark Dataframes and how to join in PySpark and join in Spark Scala.

Table of Contents

How to create and use a broadcast variable in PySpark

>>> from pyspark.sql.functions import broadcast
>>> lookup = [(1, 'apple'), (2, 'banana'), (3, 'cherry')]
>>> lookup_col = ["id", "type"]
>>> lookup_table_df = spark.createDataFrame(data=lookup, schema=lookup_col)
# Create and broadcast variable
>>> broadcast(lookup_table_df)
DataFrame[id: bigint, type: string]
# Use the broadcast variable in a Spark DataFrame to join in DataFrame
>>> df.join(lookup_table_df, df.id == lookup_table_df.id, 'inner').show()

This will produce the following result

Spark Broadcast Variable in PySpark Example

How to create and use a broadcast variable in Spark with Scala

scala> val df1 = spark.createDataFrame(Seq((1, "Alice"), (2, "Bob"), (3, "Charlie"))).toDF("id", "name")
df1: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> val df2 = sc.broadcast((Seq((1, 25), (2, 30), (3, 35))).toDF("id", "age"))
df2: org.apache.spark.broadcast.Broadcast[org.apache.spark.sql.DataFrame] = Broadcast(0)

scala> df1.join(df2.value, Seq("id"), "inner").show

This will produce the following result:

See also  Apache Spark with Amazon S3 Examples
Spark Broadcast Variable in Scala example

Why use Spark Broadcast Variables?

There are several reasons why you might want to use broadcast variables in Apache Spark:

Improved performance: By using a broadcast variable, you can avoid having to send a large value to each task over the network. This can improve the performance of your Spark job.

For example, as shown in the above code, a lookup table is a good candidate for broadcast variable.

Shared state: Broadcast variables allow you to share state across all nodes in the cluster. This can be useful if you need to maintain a consistent view of a value across all tasks.

Ease of use: As shown above, broadcast variables are easy to use and can be accessed like any other variable in Spark. This can make it simpler to write and maintain Spark jobs that require shared state.

Any Performance Concerns when using Spark Broadcast Variables?

There are a few potential concerns to keep in mind when using broadcast variables in Apache Spark:

Memory usage: Broadcast variables are stored in memory on each node in the cluster. Make sure you have enough memory available to store the broadcast variable.

Serialization: Broadcast variables are serialized and sent to each node in the cluster, so an efficient serialization format is essential. You can specify a different serialization library (such as Kryo) to improve performance.

Network traffic: If you are using a large broadcast variable, it can take time and network bandwidth to distribute it to all Spark nodes.

Read-only: As previously mentioned, broadcast variables are read-only, so you cannot modify them once they have been created. If you need to update the value of a broadcast variable, create a new broadcast variable and use that instead.

See also  Spark Thrift Server with Cassandra Example

This can be inconvenient if you need to update the value of a broadcast variable frequently.

Further Resources

Broadcast variables are included in a dedicated section in the Spark Tuning Guide.

Before you go

There are many Spark Scala tutorials and PySpark SQL tutorials on this site which you find interesting.

Leave a Reply

Your email address will not be published. Required fields are marked *