PySpark SQL functions are available for use in the SQL context of a PySpark application. These functions allow us to perform various data manipulation and analysis tasks such as filtering and aggregating data, performing inner and outer joins, and conducting basic data transformations in PySpark. PySpark functions and PySpark SQL functions are not the same […]
PySpark SQL Tutorials
PySpark SQL is a subset of PySpark which provides an interface for Apache Spark in Python. PySpark supports most of Spark’s core features and extensions such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning).
Unlike the basic Spark RDD API, the interfaces provided by PySpark SQL provide more information both about the structure of the data and the compute type being performed. There are different ways to interact with Spark SQL including SQL and the Dataset API.
Python does not have the support for the Dataset API, but many of the benefits found within the Spark Dataset API are already available from Python. For example, Python developers can already access a field in a row by name; ie. row.columnName.
Because of the PySpark SQL aforementioned interfaces providing more insight into data structure and compute type, certain performance optimizations can be realized.
PySpark SQL tutorials are available below, but if you are a Python programmer coming to PySpark SQL from Pandas or NumPy, then you should familiarize yourself with Apache Arrow for performance reasons when converting a Spark DataFrame to a Pandas DataFrame and vice versa.
How is PySpark SQL implemented in Spark?
PySpark SQL requires transfer of data between JVM and Python processes. In recent versions of Apache Spark, this transfer implementation is through Apache Arrow.
Currently, this is most beneficial to Python users working with Pandas/NumPy data. Apache Arrow usage is not automatic and disabled by default.
Installation and Usage
To use Apache Arrow in PySpark, PyArrow should be installed. If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the SQL module with the command pip install pyspark[sql]
To use Arrow when executing conversions between Spark DataFrame and Pandas DataFrame and vice versa, set the Spark configuration `spark.sql.execution.arrow.pyspark.enabled` to `true`. This is disabled by default.
And now on to the PySpark SQL tutorials.
PySpark withColumn by Example
The PySpark withColumn function is used to add a new column to a PySpark DataFrame or to replace the values in an existing column. To execute the PySpark withColumn function you must supply two arguments. The first argument is the name of the new or existing column. The second argument is the desired value to […]
PySpark UDF by Example
A PySpark UDF, or PySpark User Defined Function, is a powerful and flexible tool in PySpark. They allow users to define their own custom functions and then use them in PySpark operations. PySpark UDFs can provide a level of flexibility, customization, and control not possible with built-in PySpark SQL API functions. It can allow developers […]
PySpark Filter by Example
In PySpark, the DataFrame filter function, filters data together based on specified columns. For example, with a DataFrame containing website click data, we may wish to group together all the platform values contained a certain column. This would allow us to determine the most popular browser type used in website requests. Solutions like this may […]
How to PySpark GroupBy through Examples
In PySpark, the DataFrame groupBy function, groups data together based on specified columns, so aggregations can be run on the collected groups. For example, with a DataFrame containing website click data, we may wish to group together all the browser type values contained a certain column, and then determine an overall count by each browser […]
PySpark Joins with SQL
Use PySpark joins with SQL to compare, and possibly combine, data from two or more datasources based on matching field values. This is simply called “joins” in many cases and usually the datasources are tables from a database or flat file sources, but more often than not, the data sources are becoming Kafka topics. Regardless […]
PySpark Join Examples with DataFrame join function
PySpark joins are used to combine data from two or more DataFrames based on a common field between them. There are many different types of joins. The specific join type used is usually based on the business use case as well as most optimal for performance. Joins can be an expensive operation in distributed systems […]
PySpark MySQL Python Example with JDBC
Let’s cover how to use Spark SQL with Python and a mySQL database input data source. Shall we? Yes, yes we shall. Consider this tutorial an introductory step when learning how to use Spark SQL with a relational database and Python. If you are brand new, check out the Spark with Python Tutorial. PySpark MySQL […]
PySpark SQL JSON Examples in Python
This short PySpark SQL tutorial shows analysis of World Cup player data using PySpark SQL with a JSON file input data source from Python perspective. PySpark SQL with JSON Overview We are going to load a JSON input source to Spark SQL’s SQLContext. This Spark SQL JSON with Python tutorial has two parts. The first […]
PySpark Reading CSV with SQL Examples
In this pyspark reading csv tutorial, we will use Spark SQL with a CSV input data source using the Python API. We will continue to use the Uber CSV source file as used in the Getting Started with Spark and Python tutorial presented earlier. Also, this Spark SQL CSV tutorial assumes you are familiar with using […]