PySpark Read CSV with SQL Examples

In this pyspark read csv tutorial, we will use Spark SQL with a CSV input data source using the Python API. We will continue to use the Uber CSV source file as used in the Getting Started with Spark and Python tutorial presented earlier.

Also, this Spark SQL CSV tutorial assumes you are familiar with using SQL against relational databases directly or from Python. So, in other words, you have experience with SQL and would like to know how to use with Spark.

Why Read CSV files in PySpark?

Reading CSV files in PySpark can be beneficial for several reasons:

Scalability: PySpark is designed to handle large datasets and can distribute the processing of data across multiple nodes in a cluster. This makes it an ideal tool for processing large and/or many CSV files.
Data manipulation: PySpark provides a powerful API for manipulating data, allowing you to perform complex transformations on CSV data. This can be useful for tasks such as data cleaning, data transformation, and data aggregation as we will see in tutorial below.
Integration: PySpark can easily integrate with other big data tools and frameworks. This makes it a versatile tool for processing CSV data in a big data environment.

Reading CSV files in PySpark can be a powerful tool for processing large datasets and performing complex data transformations. Let’s find out.

PySpark Overview

Spark SQL uses a type of Resilient Distributed Dataset called DataFrames. DataFrames are composed of Row objects accompanied by a schema which describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database. As you might expect, DataFrames may be created from a variety of input sources including CSV text files.

This intro to PySpark SQL post will use a CSV file from previous Spark Python tutorials found here:

https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/Uber-Jan-Feb-FOIL.csv

PySpark Read CSV Depends on Spark Version

Your PySpark read csv approach depends on the version of Spark being used.

If you are an old version of Spark ( < Spark 2.0) the spark-csv package available from Spark Packages was released to make your lives easier, but it’s not a requirement to show pyspark reading csv. The spark-csv package is described as a “library for parsing and querying CSV data with Apache Spark, for Spark SQL and DataFrames” This library is compatible with Spark 1.3 and above.

However, if you are using Spark 2.0 and above, the spark-csv package has now been included in as described in the spark-csv github repo.

Why Read CSV files in PySpark?
PySpark Overview

PySpark Read CSV Depends on Spark Version
PySpark SQL CSV Examples Setup
- PySpark SQL CSV Example Setup < Spark 2.0
- PySpark SQL CSV Example Setup > Spark 2.0
PySpark Read CSV with SQL Examples
PySpark SQL with CSV More Advanced
PySpark SQL Resource

What are PySpark Read CSV alternatives?
Further PySpark References

PySpark SQL CSV Examples Setup

To make things easier, we are going to dive the PySpark SQL examples with CSV files into two versions: either version of Spark less than 2.0 and another version of Spark 2.0 or above.

PySpark SQL CSV Example Setup < Spark 2.0

1. Depending on your version of Scala, start the pyspark shell with a packages command line argument.

At time of this writing, scala 2.10 version:

$SPARK_HOME/bin/pyspark --packages com.databricks:spark-csv_2.10:1.3.0

At time of this writing, scala 2.11 version:

$SPARK_HOME/bin/pyspark --packages com.databricks:spark-csv_2.11:1.3.0

2. Using the available sqlContext from the shell load the CSV read, format, option and load functions

>>> df = sqlContext.read.format('com.databricks.spark.csv')
.options(header='true', inferschema='true')
.load('Uber-Jan-Feb-FOIL.csv')

In the above code, we are specifying the desire to use com.databricks.spark.csv format from the package we passed to the shell in step 1. “header” set to true signifies the first row has column names. “inferSchema” instructs Spark to attempt to infer the schema of the CSV and finally load function passes in the path and name of the CSV source file. In this example, we can tell the Uber-Jan-Feb-FOIL.csv file is in the same directory as where pyspark was launched.

Note: you may be interested in comparing this example above to the Spark >= 2.0 version below. There is a csv function now which makes things more convenient.

3. Register a temp table

>>> df.registerTempTable("uber")

Similar to how the spark-csv package requirement has changed over time, the registerTempTable function has also changed. It is no longer used in Spark 2.0 and above as shown in the next setup example.

PySpark SQL CSV Example Setup > Spark 2.0

1. Start PySpark

~/dev/tmp $pyspark
Python 3.9.12 (main, Mar 26 2022, 15:51:15)
[Clang 13.1.6 (clang-1316.0.21.2)] on darwin
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Python version 3.9.12 (main, Mar 26 2022 15:51:15)
Spark context Web UI available at http://192.168.1.15:4040
Spark context available as 'sc' (master = local[*], app id = local-1669304534671).
SparkSession available as 'spark'.
>>>

This is just to show the Python and Spark version used in this setup.

2. Next, using the available SparkContext from the shell load the CSV read, format, options and csv functions

>>> df = spark.read.options(header='true', inferschema='true').csv('Uber-Jan-Feb-FOIL.csv')

In the options function code above, the header option being set to true signifies the first row has column names. Also, with inferSchema being set to true, this instructs Spark to attempt to infer the schema of the CSV. Last, we use the csv function to pass in the path and name of the CSV source file. In this example, we can tell the Uber-Jan-Feb-FOIL.csv file is in the same directory as where pyspark was launched.

3. Register a temp table

>>> df.createOrReplaceTempView("uber")

The createOrReplaceTempView either creates or replaces a local, temporary view with the provided DataFrame. It used instead of registerTempTable in newer versions of Spark. In either case, temporary views is similar in concept to SQL tables where each table contains rows and columns.