
This tutorial covers using Spark SQL with a JSON file input data source in Scala. If you are interested in using Python instead, check out Spark SQL JSON in Python tutorial page.
Spark SQL JSON Overview
We will show examples of JSON as input source to Spark SQL’s SQLContext. This Spark SQL tutorial with JSON has two parts. Part 1 focus is the “happy path” when using JSON with Spark SQL. Part 2 covers a “gotcha” or something you might not expect when using Spark SQL JSON data source.
By the way, If you are not familiar with Spark SQL, a couple of references include a summary of Spark SQL chapter post and the first Spark SQL CSV tutorial.
Methodology
We build upon the previous baby_names.csv file as well as a simple file to get us started which I’ve called customers.json. Here is a gist of customers.json.
Spark SQL JSON Example Tutorial Part 1
1. Start the spark shell
$SPARK_HOME/bin/spark–shell
2. Load the JSON using the jsonFile function from the provided sqlContext. The following assumes you have customers.json in the same directory as from where the spark-shell script was called.
RBH12103:spark-1.4.1-bin-hadoop2.4 tmcgrath$ bin/spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala> val customers = sqlContext.jsonFile("customers.json")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
customers: org.apache.spark.sql.DataFrame = [address: struct<city:string,state:string,street:string,zip:string>, first_name: string, last_name: string]
3. Register the data as a temp table to ease our future SQL queries
scala> customers.registerTempTable("customers")
4. We are now in a position to run some Spark SQL
scala> val firstCityState = sqlContext.sql("SELECT first_name, address.city, address.state FROM customers")
firstCityState: org.apache.spark.sql.DataFrame = [first_name: string, city: string, state: string]
scala> firstCityState.collect.foreach(println)
[James,New Orleans,LA]
[Josephine,Brighton,MI]
[Art,Bridgeport,NJ]
Ok, we started with a simple example, but the real world is rarely this simple. So, in part 2, we’ll cover a more complex example.
Spark SQL JSON Example Tutorial Part 2
If you run the customers.json from part 1 through http://jsonlint.com, it will not validate. You might be surprised to know that creating invalid JSON for Part 1 was intentional. Why? We needed JSON source which works well with Spark SQL out of the box.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
But, what happens if we have valid JSON?
In this part of the Spark SQL JSON tutorial, we’ll cover how to use valid JSON as an input source for Spark SQL.
As input, we’re going to convert the baby_names.csv file to baby_names.json. There are many CSV to JSON conversion tools available… just search for “CSV to JSON converter”.
I converted and reduced the baby_names.csv to the following:
[{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "QUEENS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "MOSHE",
"County": "KINGS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "ETHAN",
"County": "QUEENS",
"Sex": "M",
"Count": "216"
}]
I saved this as a file called baby_names.json
Steps
1. Start the spark-shell from the same directory containing the baby_names.json file
2. Load the JSON using the Spark Context wholeTextFiles method which produces a PairRDD. Use map to create the new RDD using the value portion of the pair.
scala> val jsonRDD = sc.wholeTextFiles("baby_names.json").map(x => x._2)
jsonRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at map at <console>:21
3. Read in this RDD as JSON and confirm the schema
scala> val namesJson = sqlContext.read.json(jsonRDD)
namesJson: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string]
scala> namesJson.printSchema
root
|-- Count: string (nullable = true)
|-- County: string (nullable = true)
|-- First Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Year: string (nullable = true)
scala>
Spark SQL Further Reference
See Spark SQL and Spark SQL with Scala and PySpark SQL tutorials.
Featured image credit https://flic.kr/p/9pJKgA
This is a nice read. If you have some free can you guide me solve the below scenario ?
How do i process a file when it has multiple json records in a file ?. eg
{
name : “Test Name”
age : “10”
}
{
name : “Test Name2”
age : “50”
}
is there a way to process files like these ? Or do i have to write a custom inputformat like in hadoop for this ?
SQL contexts read.json will read thiis type of data too. just pass it to json method. and do your processing normally.
it was help full for me , how to access array elements in json schema . I would like to read promotions elements . please suggest
|– promotions: array (nullable = true)
| |– element: struct (containsNull = true)
| | |– promotionContent: struct (nullable = true)
| | | |– ID: string (nullable = true)
I had the same problem.
In spark you can use:
scala> val selectView = sqlContext.sql(“SELECT inline(promotions) FROM table_name”)
If you want saving the result in hive.
Execute the sentence:
scala>selectView.saveAsTable(“database.final_table”)
I hope this could help you.
what if I have a Json file which looks like this
{
“Name”: “Hafeez”,
“Type”: “Normal”,
“Lang”: “French”,
“tags”: [{
“1”: “Kid”,
“2”: “Path”,
“3”: “LCD”
}, {
“1”: “Brother”,
“2”: “Kole”,
“3”: “screen”,
“4”: 6
}]
}
and I need the schema to look like this
scala> namesJson.printSchema
root
|– Name: string (nullable = true)
|– Type: string (nullable = true)
|– Lang: string (nullable = true)
|– 1: string (nullable = true)
|– 2: string (nullable = true)
|– 3: string (nullable = true)
|– 4: Integer (nullable = true)
How can we fetch the column name as address.city (parent_column.child_Column) in spark SQL
What if I need to update/mask a field in JSON file, may be name, address fields?
I have tons of JSON files and the fields can also be nested to any levels, can spark be helpful in updating the PII fields? Any Idea would be helpful.