PySpark UDFs Demystified: Learn with Step-by-Step Examples


A PySpark UDF, or PySpark User Defined Function, is a powerful and flexible tool in PySpark. They allow users to define their own custom functions and then use them in PySpark operations.  PySpark UDFs can provide a level of flexibility, customization, and control not possible with built-in PySpark SQL API functions.  It can allow developers to build their own custom APIs which may be unique to their business specific workloads.

Table of Contents

Overview of PySpark UDFs

In this PySpark UDF tutorial, let’s define a PySpark UDF and then cover examples of use.

Consider the following example where you have a DataFrame with a column containing text.  You want to replace the word “hello” with “bonjour” in every row of the DataFrame.  How would you do that?

In Python, use text.replace() function and it is easily accomplished with a small dataset.

But what if your dataset has millions of rows? Using a simple text.replace() function would be inefficient and time-consuming.  It would not be scalable to larger datasets.

This is where PySpark UDFs come in.

PySpark UDFs provide the mechanism for  defining a custom function to perform the text replacement.  In PySpark, you can leverage the distributed computing power of a PySpark cluster to process the data in parallel with your custom function. Processing large datasets quickly, efficiently, and without having to wait for each row to be processed one at a time is now possible.

To define a PySpark UDF, specify the input and output data types as part of the function signature. For example, here is an UDF which replaces the word “hello” with “bonjour” in a string:

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
# define the udf
replace_hello_with_bonjour = udf(lambda x: x.replace("hello", "bonjour"), StringType())

PySpark UDF Example 1

Use the UDF in any PySpark operation just like any other built-in function. To apply the UDF to the DataFrame from our example above, you could use the following code:

# Apply the UDF to the dataframe
df = df.withColumn("text", replace_hello_with_bonjour(df.text))

Using a PySpark UDF allows us to easily and efficiently process large datasets in parallel, without having to worry about the scalability of your operations.

PySpark UDF Example 2

Let’s consider a slightly more complex example of using a custom PySpark UDF to process a large dataset.

Suppose we have a dataset containing information about different types of fruit, including:

  • Type of fruit
  • Color of the fruit
  • Weight of the fruit

Imagine a dataset stored in a CSV file.  We want to convert the fruit type and color columns into a single combined column.

First, let’s assume we use PySpark’s built-in CSV reader to load the data from the CSV file into a DataFrame:

# Load the data from the CSV file
df = spark.read.csv("fruit_data.csv", header=True)

Next, we define a PySpark UDF to combine the fruit type and color columns into a single column. We define the UDF to take two string arguments, fruit type and color.  The result of the function is a single string with the combined fruit type and color.

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
# define the udf
combine_fruit_type_and_color = udf(lambda type, color: f"{type} ({color})", StringType())

As expected, we can use this new UDF to apply a PySpark transformation to the DataFrame. We use the withColumn() method to create a new column in the DataFrame containing the combined fruit type and color from our UDF.

# Apply the UDF to the dataframe
df = df.withColumn("fruit", combine_fruit_type_and_color(df.type, df.color)

Finally, we use PySpark’s built-in CSV writer to save the transformed DataFrame back to a new CSV file:

Save the transformed dataframe to a CSV file
df.write.csv("fruit_data_transformed.csv", header=True)

By defining your own custom PySpark UDF functions, you can gain greater control over the data processing pipeline and solve more business use-case specific problems in data analysis and machine learning.

PySpark UDF Example 3 Setup

Now, let’s dive deeper and consider a bit more complex example of a PySpark UDF. In this example, we will use a PySpark UDF to perform sentiment analysis on a large dataset of movie reviews. We will use the dataset to train a machine learning model that can predict the sentiment of a movie review (positive or negative) based on the text of the review.

To begin, we need to load the movie review dataset into a PySpark DataFrame. Again, we can use PySpark’s built-in CSV reader to load the data from a CSV file:

# Load the data from the CSV file
df = spark.read.csv("movie_reviews.csv", header=True)

Next, we need to define our PySpark UDF to perform sentiment analysis on the movie review text. We will define the UDF to take a string argument (the text of the movie review) and return a numeric value representing the sentiment of the review (1 for positive, 0 for negative).

In the UDF, we will use the NLTK library to perform natural language processing on the movie review text.

NLTK, or Natural Language Toolkit, is a popular Python library for building natural language processing (NLP) systems.  It provides a number of useful tools for performing sentiment analysis, including a pre-trained sentiment analysis model.

import nltk
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
Define the UDF
nltk.download("vader_lexicon")
sentiment_analyzer = nltk.SentimentIntensityAnalyzer()
PySpark UDF is all you need! 5
analyze_sentiment = udf(lambda text: int(sentiment_analyzer.polarity_scores(text)["pos"] >
0.5), IntegerType())

PySpark UDF Example 3 Code Analysis

The code imports IntegerType and udf from the pyspark.sql.types and pyspark.sql.functions modules, respectively. IntegerType is a class that represents a data type for integer values in PySpark SQL and by now, you know more about udf.

The code downloads a sentiment analysis lexicon called “vader_lexicon” using the nltk.download() function. This lexicon is a list of words and their associated sentiment scores. It is used to determine the overall sentiment of a piece of text.

Next, the code creates a SentimentIntensityAnalyzer object from the nltk.SentimentIntensityAnalyzer class. This object analyzes and calculates sentiment scores  a piece of text.

Finally, the code defines a new UDF called analyze_sentiment using the udf() function.

As we can now see, the UDF takes a single text argument and returns 1 if the overall sentiment of the text is positive or a 0 otherwise. The UDF is defined to return integer values and is implemented using the SentimentIntensityAnalyzer object created earlier.

Use the UDF to determine sentiment of the movie review text in our DataFrame. Similar to the previous example, withColumn() method is utilized to create a new column in the DataFrame containing the sentiment of each movie review.

PySpark UDF Conclusion

We hope you enjoyed this tutorial on PySpark UDFs.  If you have any questions or suggestions for improvement, please use the the comments below.

Further Resources

Before you go

You may be interested in exploring more PySpark Tutorials.

See also  Deep dive into PySpark SQL Functions
About Todd M

Todd has held multiple software roles over his 20 year career. For the last 5 years, he has focused on helping organizations move from batch to data streaming. In addition to the free tutorials, he provides consulting, coaching for Data Engineers, Data Scientists, and Data Architects. Feel free to reach out directly or to connect on LinkedIn

Leave a Comment