Spark SQL JSON Examples

Spark SQL JSON

This tutorial covers using Spark SQL with a JSON file input data source.

Overview

We will show examples of JSON as input source to Spark SQL’s SQLContext.  This Spark SQL tutorial with JSON has two parts.  Part 1 focus is the “happy path” when using JSON with Spark SQL.  Part 2 covers a “gotcha” or something you might not expect when using Spark SQL JSON data source.

By the way, If you are not familiar with Spark SQL, a couple of references include a summary of Spark SQL chapter post and the first Spark SQL CSV tutorial.

Methodology

We build upon the previous baby_names.csv file as well as a simple file to get us started which I’ve called customers.json.  Here is a gist of customers.json.

Spark SQL JSON Example Tutorial Part 1

1. Start the spark shell

$SPARK_HOME/bin/sparkshell

2. Load the JSON using the jsonFile function from the provided sqlContext.  Following assumes you have customers.json in same directory as from where the spark-shell script was called.

RBH12103:spark-1.4.1-bin-hadoop2.4 tmcgrath$ bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val customers = sqlContext.jsonFile("customers.json")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
customers: org.apache.spark.sql.DataFrame = [address: struct<city:string,state:string,street:string,zip:string>, first_name: string, last_name: string]

3. Register the data as a temp table to ease our future SQL queries

scala> customers.registerTempTable("customers")

4. We are now in position to run some SQL

scala> val firstCityState = sqlContext.sql("SELECT first_name, address.city, address.state FROM customers")
firstCityState: org.apache.spark.sql.DataFrame = [first_name: string, city: string, state: string]

scala> firstCityState.collect.foreach(println)
[James,New Orleans,LA]
[Josephine,Brighton,MI]
[Art,Bridgeport,NJ]

Ok, we started with a simple example, but the real world is rarely this simple.  So, in part 2, we’ll cover a more complex example.

Spark SQL JSON Example Tutorial Part 2

If you run the customers.json from part 1 through http://jsonlint.com, it will not validate.  You might be surprised to know that creating invalid JSON for Part 1 was intentional.  Why?  We needed JSON source which works well with Spark SQL out of the box.

If you read the Spark SQL documentation closely:

“Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.”

But, what happens if we have valid JSON?

In this part of the Spark SQL JSON tutorial, we’ll cover how to use valid JSON as an input source for Spark SQL.

As input, we’re going to convert the baby_names.csv file to baby_names.json.  There are many CSV to JSON conversion tools available… just search for “CSV to JSON converter”.

I converted and reduced the baby_names.csv to the following:

[{
	"Year": "2013",
	"First Name": "DAVID",
	"County": "KINGS",
	"Sex": "M",
	"Count": "272"
}, {
	"Year": "2013",
	"First Name": "JAYDEN",
	"County": "KINGS",
	"Sex": "M",
	"Count": "268"
}, {
	"Year": "2013",
	"First Name": "JAYDEN",
	"County": "QUEENS",
	"Sex": "M",
	"Count": "219"
}, {
	"Year": "2013",
	"First Name": "MOSHE",
	"County": "KINGS",
	"Sex": "M",
	"Count": "219"
}, {
	"Year": "2013",
	"First Name": "ETHAN",
	"County": "QUEENS",
	"Sex": "M",
	"Count": "216"
}]

I saved this as a file called baby_names.json

Steps

1. Start the spark-shell from the same directory containing the baby_names.json file

2. Load the JSON using the Spark Context wholeTextFiles method which produces a PairRDD.  Use map to create the new RDD using the value portion of the pair.

scala> val jsonRDD = sc.wholeTextFiles("baby_names.json").map(x => x._2)
jsonRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at map at <console>:21

3. Read in this RDD as JSON and confirm the schema

scala> val namesJson = sqlContext.read.json(jsonRDD)
namesJson: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string]

scala> namesJson.printSchema
root
 |-- Count: string (nullable = true)
 |-- County: string (nullable = true)
 |-- First Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Year: string (nullable = true)

scala>

 

Featured image credit https://flic.kr/p/9pJKgA

6 thoughts on “Spark SQL JSON Examples

  1. This is a nice read. If you have some free can you guide me solve the below scenario ?
    How do i process a file when it has multiple json records in a file ?. eg

    {
    name : “Test Name”
    age : “10”
    }
    {
    name : “Test Name2”
    age : “50”
    }

    is there a way to process files like these ? Or do i have to write a custom inputformat like in hadoop for this ?

    1. SQL contexts read.json will read thiis type of data too. just pass it to json method. and do your processing normally.

  2. it was help full for me , how to access array elements in json schema . I would like to read promotions elements . please suggest

    |– promotions: array (nullable = true)
    | |– element: struct (containsNull = true)
    | | |– promotionContent: struct (nullable = true)
    | | | |– ID: string (nullable = true)

    1. I had the same problem.

      In spark you can use:
      scala> val selectView = sqlContext.sql(“SELECT inline(promotions) FROM table_name”)
      If you want saving the result in hive.
      Execute the sentence:
      scala>selectView.saveAsTable(“database.final_table”)
      I hope this could help you.

  3. what if I have a Json file which looks like this
    {
    “Name”: “Hafeez”,
    “Type”: “Normal”,
    “Lang”: “French”,
    “tags”: [{
    “1”: “Kid”,
    “2”: “Path”,
    “3”: “LCD”
    }, {
    “1”: “Brother”,
    “2”: “Kole”,
    “3”: “screen”,
    “4”: 6
    }]
    }

    and I need the schema to look like this
    scala> namesJson.printSchema
    root
    |– Name: string (nullable = true)
    |– Type: string (nullable = true)
    |– Lang: string (nullable = true)
    |– 1: string (nullable = true)
    |– 2: string (nullable = true)
    |– 3: string (nullable = true)
    |– 4: Integer (nullable = true)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.