With Spark read JSON, users can easily load JSON data into Spark DataFrames, which can then be manipulated using Spark’s powerful APIs. This allows users to perform complex data transformations and analysis on JSON data, without the need for complex parsing and data manipulation code. In addition, Spark Read JSON supports a wide range of JSON data formats, including nested and complex structures, making it a versatile tool for working with JSON data.
This Spark read json tutorial covers using Spark SQL with a JSON file input data source in Scala.
** The original tutorial was created for Spark 1.4, but it has been updated for Spark 3.4 as May 2023 **
If you are interested in using Python instead, check out Spark SQL JSON in Python tutorial page.
Table of Contents
- Spark SQL JSON Overview
- Spark Read JSON Example Tutorial Part 1
- Spark SQL JSON Example Tutorial Part 2
- Why use Spark to read JSON?
- Spark Read JSON and Query with SQL Further External Reference
Spark SQL JSON Overview
We will show examples of JSON as input source to Spark SQL’s SQLContext. This Spark SQL tutorial with JSON has two parts. Part 1 focus is the “happy path” when using JSON with Spark SQL. Part 2 covers a “gotcha” or something you might not expect when using Spark SQL JSON data source.
By the way, If you are not familiar with Spark SQL, a couple of references include a summary of Spark SQL chapter post and the first Spark read CSV in Scala tutorial.
We will use a customers.json file as a simple file to get us started which I’ve called customers.json. Here is a link to customers.json file. In part 2, we use a more advanced example JSON file, but this Spark read json tutorial is designed to start simple intentionally.
Spark Read JSON Example Tutorial Part 1
1. Start the spark shell
2. Spark Read JSON Example
Next, in the spark shell which conveniently provides a SparkSession, we can use the
spark.read.json function to load the example JSON file into a new DataFrame as shown:
scala> val customers = spark.read.json("customers.json") customers: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string ... 2 more fields>, first_name: string ... 1 more field]
The above code sample assumes you have customers.json in the same directory as from where the spark-shell script was called.
3. JSON DataFrame View
Next let’s register the JSON based DataFrame to a temp table to ease our future SQL queries as shown next:
4. Query JSON with SQL
We are now in a position to run some Spark SQL
scala> spark.sql("SELECT first_name, address.city, address.state FROM customers").show +----------+-----------+-----+ |first_name| city|state| +----------+-----------+-----+ | James|New Orleans| LA| | Josephine| Brighton| MI| | Art| Bridgeport| NJ| +----------+-----------+-----+
Ok, we started with a simple example, but the real world is rarely this simple. So, in part 2, we’ll cover a more complex example.
Spark SQL JSON Example Tutorial Part 2
If you run the customers.json from part 1 above through http://jsonlint.com, it will not validate.
You may be surprised to know that creating invalid JSON for Part 1 was intentional.
Two part answer.
Part 1 – we needed JSON source which works well to Spark read JSON and then query with Spark SQL quickly to have success and build momentum.
But, if you read the Spark SQL documentation closely:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.Source: Apache Spark documentation
So, this leads to the question what happens if we have valid JSON?
In this part of the Spark read JSON tutorial, we’ll cover how to use valid JSON as an input source for reading from Spark and then query with Spark SQL.
For input, we’ll use a file called baby_names.json.
It looks like this:
"First Name": "DAVID",
"First Name": "JAYDEN",
"First Name": "JAYDEN",
"First Name": "MOSHE",
"First Name": "ETHAN",
I saved this as a file called baby_names.json and is available from spark-shell.
It is available for download at: https://github.com/supergloo/spark-scala-examples/tree/main/spark-read-json
Spark Read JSON Part 2 Steps
1. Start the spark-shell
This example assumes the baby_names.json file is in the same directory as where spark-shell was started.
2. Spark Read JSON [Failure]
Let’s show an example which I know will fail for reasons previously described:
scala> val names = spark.read.json("baby_names.json") names: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
I consider this a failure because of the _corrupt_record: string result. Let’s fix this next.
3. Spark Read JSON _corrupt_record How to Fix
To fix the previously failure of _corrupt_record in Spark read JSON attempt, we need to set the “multiLine” option as shown here
scala> val n = spark.read.option("multiLine", true).json("baby_names.json") n: org.apache.spark.sql.DataFrame = [Count: string, County: string ... 3 more fields] // now we can register a temp table and query with SQL scala> n.createOrReplaceTempView("babies") scala> spark.sql("SELECT * FROM babies").show +-----+------+----------+---+----+ |Count|County|First Name|Sex|Year| +-----+------+----------+---+----+ | 272| KINGS| DAVID| M|2013| | 268| KINGS| JAYDEN| M|2013| | 219|QUEENS| JAYDEN| M|2013| | 219| KINGS| MOSHE| M|2013| | 216|QUEENS| ETHAN| M|2013| +-----+------+----------+---+----+
So, there you go. We can solve _corrupt_record by using the “mutliLine” option set to
true to fix.
By the way, this was much more difficult in Spark 1.
I use to have to send through a
map function after loading the file with
wholeTextFiles function with SparkContext. The wholeTextFiles method produced a PairRDD, and then map to create the new RDD using the value portion of the pair.
scala> val jsonRDD = sc.wholeTextFiles("baby_names.json").map(x => x._2) jsonRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at map at <console>:21
Then, we could use this RDD as JSON and confirm the schema
scala> val namesJson = sqlContext.read.json(jsonRDD) namesJson: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string] scala> namesJson.printSchema root |-- Count: string (nullable = true) |-- County: string (nullable = true) |-- First Name: string (nullable = true) |-- Sex: string (nullable = true) |-- Year: string (nullable = true) scala>
But, luckily in Spark 2 and beyond, the “multiLine” option can make things much more simple.
Why use Spark to read JSON?
Here are some reasons why using Spark to read JSON can be beneficial:
- Efficient: Spark’s distributed computing architecture allows it to process large JSON datasets in a highly parallelized manner
- Flexibility: JSON data can have a complex and flexible schema as we saw above, which can be challenging at times. Spark’s ability to handle semi-structured and nested data can make working with JSON more flexible.
- Integration: Spark can integrate with other data sources, such as Hadoop Distributed File System (HDFS), Apache Cassandra, and Amazon S3. This makes it easy to read JSON data from these sources.
- Rich set of APIs: Spark provides a rich set of APIs for working with JSON data that evolved significantly over time to perform complex load and queries on JSON datasets.
- Scalable: Spark is designed to scale horizontally.
Overall, using Spark to read JSON data can provide a solution for processing large and complex JSON based datasets. .
Spark Read JSON and Query with SQL Further External Reference
- The Apache Spark documentation on using JSON https://spark.apache.org/docs/latest/sql-data-sources-json.html
- Source for JSON files used in this tutorial: https://github.com/supergloo/spark-scala-examples/tree/main/spark-read-json