In this tutorial, we will cover using Spark SQL with a mySQL database.
Let’s show examples of using Spark SQL mySQL. We’re going to use mySQL with Spark in this tutorial, but you can apply the concepts presented here to any relational database which has a JDBC driver.
By the way, If you are not familiar with Spark SQL, there are a few Spark SQL tutorials on this site.
1. MySQL instance
2. MySQL JDBC driver (download available https://dev.mysql.com/downloads/connector/j/)
3. Previously used baby_names.csv file as source data.
4. Apache Spark
The Spark SQL with MySQL JDBC example assumes a mysql db named “sparksql” with table called “baby_names”. The “baby_names” table has been populated with the baby_names.csv data used in previous Spark tutorials.
Here’s a screencast on YouTube of how I set up my environment:
The SQL to create the baby_names table:
DROP TABLE IF EXISTS `baby_names`; CREATE TABLE `baby_names` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `year` int(11) DEFAULT NULL, `first_name` varchar(100) DEFAULT NULL, `county` varchar(100) DEFAULT NULL, `sex` varchar(5) DEFAULT NULL, `count` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
If you have any questions about the environment setup, leave comments on this post.
We need to pass in the mySQL JDBC driver jar when we start up the Spark Shell. (In a Spark application, any third party libs such as a JDBC driver would be included in package.)
From Spark shell we’re going to establish a connection to the mySQL db and then run some queries via Spark SQL. This is a getting started with Spark mySQL example. To build and deploy and Spark application with mySQL JDBC driver you may wish to check out the Spark cluster deploy with extra jars tutorial.
Spark SQL with MySQL (JDBC) Example Tutorial
1. Start the spark shell with --jars argument
$SPARK_HOME/bin/spark--shell --jars mysql-connector-java-5.1.26.jar
This example assumes the mySQL connector JDBC jar file is located in the same directory as where you are calling spark-shell. If it is not, you can specify the path location such as:
$SPARK_HOME/bin/spark--shell --jars /home/example/jars/mysql-connector-java-5.1.26.jar
2. Once the Spark shell is running, let’s establish a connection to mySQL db and read the baby_names table:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79) Type in expressions to have them evaluated. Type :help for more information. ... SQL context available as sqlContext. scala> val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost/sparksql").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "baby_names").option("user", "root").option("password", "root").load()
Change the mySQL url and user/password values in the above code appropriate to your environment.
3. Let’s confirm the dataframe by show the schema of the table
3. Register the data as a temp table for future SQL queries
4. We are now in a position to run some SQL such as
scala> dataframe_mysql.sqlContext.sql("select * from names").collect.foreach(println)
Conclusion Spark SQL with MySQL (JDBC)
This example was designed to get you up and running with Spark SQL and mySQL or any JDBC compliant database quickly. What other examples would you like to see with Spark SQL and JDBC? Please leave ideas or questions in comments below.
And to keep up with changes in Spark SQL, especially DataFrame vs DataSet, check the Apache Spark SQL documentation from time-to-time.
Featured Image credit: https://flic.kr/p/f8KB7L