Spark SQL CSV Examples in Scala

Spark SQL CSV Example

In this Spark SQL tutorial, we will use Spark SQL with a CSV input data source.  We will continue to use the baby names CSV source file as used in the previous What is Spark tutorial.  This tutorial presumes the reader is familiar with using SQL with relational databases and would like to know how to use Spark SQL in Spark.

Overview

Earlier versions of Spark SQL required a certain kind of Resilient Distributed Data set called SchemaRDD.   Starting with Spark 1.3, Schema RDD was renamed to DataFrame.

DataFrames are composed of Row objects accompanied with a schema which describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database. A DataFrame may be created from a variety of input sources including CSV text files.

This intro to Spark SQL post will use a CSV file from a previous Spark tutorial.

Download the CSV version of baby names file here:

https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv?accessType=DOWNLOAD

For this and other Spark tutorials, the file has been named baby_names.csv

Methodology

We’re going to use the spark shell and the spark-csv package available from Spark Packages to make our lives easier.  It is described as a “library for parsing and querying CSV data with Apache Spark, for Spark SQL and DataFrames”  This library is compatible with Spark 1.3 and above.

Spark SQL CSV Example Tutorial Part 1

1. Depending on your version of Scala, start the spark shell with a packages command line argument.

At time of this writing, scala 2.10 version:

SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0

where SPARK_HOME is the root directory of your Spark directory; i.e. ~/Development/spark-1.4.1-bin-hadoop2.4 or c:/dev/spark-1.4.1-bin-hadoop2.4

At time of this writing, scala 2.11 version:

SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0

2. Using the available sqlContext from the shell load the CSV read, format, option and load functions

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val baby_names = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("baby_names.csv")
baby_names: org.apache.spark.sql.DataFrame = [Year: int, First Name: string, County: string, Sex: string, Count: int]


In the above code, we are specifying the desire to use com.databricks.spark.csv format from the package we passed to the shell in step 1.  “header” set to true signifies the first row has column names.  “inferSchema” instructs Spark to attempt to infer the schema of the CSV and finally load function passes in the path and name of the CSV source file.  In this example, we can tell the baby_names.csv file is in the same directory as where the spark-shell script was launched.

3. Register a temp table

scala> baby_names.registerTempTable("names")

4. We’re now ready to query using SQL such as finding the distinct years in the CSV

scala> val distinctYears = sqlContext.sql("select distinct Year from names")
distinctYears: org.apache.spark.sql.DataFrame = [Year: int]

scala> distinctYears.collect.foreach(println)
[2007]                                                                          
[2008]
[2009]
[2010]
[2011]
[2012]
[2013]

5. It might be handy to know the schema

scala> baby_names.printSchema
root
|-- Year: integer (nullable = true)
|-- First Name: string (nullable = true)
|-- County: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Count: integer (nullable = true)

Spark SQL CSV Example Tutorial Part 2

But, let’s try some more advanced SQL, such as determining the names which appear most often in the data

scala> val popular_names = sqlContext.sql("select distinct(`First Name`), count(County) as cnt from names group by `First Name` order by cnt desc LIMIT 10")
popular_names: org.apache.spark.sql.DataFrame = [First Name: string, cnt: bigint]

scala> popular_names.collect.foreach(println)
[JACOB,284]
[EMMA,284]
[LOGAN,270]
[OLIVIA,268]
[ISABELLA,259]
[SOPHIA,249]
[NOAH,247]
[MASON,245]
[ETHAN,239]
[AVA,234]

But as we learned from Spark transformation and action tutorials, this doesn’t necessarily imply the most popular names.  We need to utilize the Count column value:

scala> val popular_names = sqlContext.sql("select distinct(`First Name`), sum(Count) as cnt from names group by `First Name` order by cnt desc LIMIT 10")
popular_names: org.apache.spark.sql.DataFrame = [First Name: string, cnt: bigint]

scala> popular_names.collect.foreach(println)
[MICHAEL,10391]
[ISABELLA,9106]
[MATTHEW,9002]
[JAYDEN,8948]
[JACOB,8770]
[JOSEPH,8715]
[SOPHIA,8689]
[DANIEL,8412]
[ANTHONY,8399]
[RYAN,8187]
Spark SQL Further Reference

Check out Spark SQL with Scala tutorials for more Spark SQL with Scala including Spark SQL with JSON and Spark SQL with JDBC.  And for tutorials in Scala, see Spark Tutorials in Scala page.

Featured image credit https://flic.kr/p/3CrmX

5 thoughts on “Spark SQL CSV Examples in Scala

  1. Hello Admin,

    I have a doubt about “start the spark shell with a packages command line argument” (step 1 – Spark SQL CSV Example Tutorial Part 1). How can I to setup these parameters in Ubuntu Linux? Because I typed that line on the terminal but isn’t works! Is it necessary to setup some file as ~/.bashrc?

    I appreciated! Thanks!

      1. Hello Admin,

        I typed that line about colors, some like this “[craynion …. line=sasfghr456778]$SPARK_HOME/bin/spark-shell –packages com.databricks:spark-csv_2.10:1.3.0”

        But I saw that the page is different now. Did you change? I think so, because the line command is “SPARK_HOME/bin/spark-shell –packages com.databricks:spark-csv_2.10:1.3.0”! I tried and it works!!! Thanks!!! : )

        I appreciate your support! I’m sorry about my English, I’m from Brazil!

        Best regards!

        1. Great! Glad to hear it’s working for you. I did change the page based on your comment, so thank you too!

          Your English is fine. No worries.

  2. Hey Admin,

    I am trying to count the no. of rows of a particular column having a certain condition , but i am having trouble using the SqlContext in order to achieve it.Can u give me some suggestions?

Leave a Reply

Your email address will not be published. Required fields are marked *