In this Spark Read JDBC tutorial, we will cover using Spark SQL with a mySQL database.
Spark’s read JDBC methods allows us to read data and create DataFrames from a relational database supporting JDBC connectivity.
It is useful for a variety of reasons including leveraging Spark’s distributed computing capabilities for processing data stored in a traditional database.
Using read.format("jdbc")
as shown in the Spark read JDBC examples belwo, we can specify a JDBC connection URL, username, and password, as well as any other JDBC properties required to connect to the database.
Once the data is loaded into a DataFrame, we can then use Spark’s SQL and DataFrame APIs to perform various transformations and analyses on the data. We will go through examples of this below.
Overall, Spark’s read from JDBC methods provides a fairly convenient way to access and process data stored in a traditional databases using Spark.
** Updated May 2023. This tutorial was originally published in 2015 **
Note: These Spark read examples take a Scala centric approach. If you are more interested in Python, check out the pyspark jdbc tutorial.
Table of Contents
- Overview
- Spark Read JDBC Examples Requirements
- MySQL Setup for Spark JDBC Example Overview
- Spark JDBC Overview
- Spark Read JDBC with MySQL Example Tutorial Part 1
- Spark JDBC Write and Read Example Part 2
- Conclusion Spark Read JDBC with MySQL
- Spark JDBC Examples Resources
Overview
In this Spark read JDBC tutorial, we’ll explore code examples of using Spark SQL mySQL by connecting over JDBC.
We’re going to use mySQL with Spark in this tutorial, but you can apply the concepts presented here to any relational database which has a JDBC driver.
spark.read.jdbc()
is a method in Spark’s DataFrameReader API to read data from a JDBC data source and create a DataFrame. The read.jdbc()
method takes a JDBC connection URL, a table or query, and a set of optional parameters to specify how to connect to the database. We’ll go through specific examples below.
By the way, If you are not familiar with Spark SQL, there are a few Spark SQL tutorials on this site.
Spark Read JDBC Examples Requirements
1. Docker and Docker Compose
I will be using docker compose in these examples in case you’d like to follow along. There is a screencast with some really catchy intro music and effects below. I use the mysql.yml compose file which is available from the Github in Reference section below.
docker-compose -f mysql.yml up -d
Even if you don’t like catchy music, the screencast below shows setting up the mySQL environment used in these examples.
2. MySQL JDBC driver
Download available https://dev.mysql.com/downloads/connector/j/
3. Apache Spark
In the first example, we will run examples where the JDBC driver from previous step 2 needs to be in the classpath. I’ll explain more below.
4. Sequel Ace [Optional]
Sequel Ace is used to import the CSV file used in the Spark Read JDBC examples. There is more than one way to import a CSV file into mySQL as you probably already know, so I’m going to leave this step as optional because you may choose to import a different way.
We’re going to use baby_names.csv file as source data for the new table. This file is available from the Github repo mentioned below.
MySQL Setup for Spark JDBC Example Overview
I’ll list all steps to repeat the demo, but let’s cover the high level overview first.
In this first Spark read example, I’m going to run mySQL from a Docker container using docker-compose. The docker compose file used is available from the supergloo Github repo (link in Resources section below).
Next, this Spark JDBC examples with MySQL tutorial assumes a mysql database named “spark-jdbc” with table called “baby_names” has been created. I’ll walk through how I created in case it’s helpful to you, but again, there are more than one way to do it.
Last, the “baby_names” table has been populated with the baby_names.csv data used in previous Spark tutorials. This CSV file is also available from Github repo in the Resources section below.
Here’s a screencast on YouTube of how I set up my environment:
In case you’re wondering, the SQL to create the baby_names table:
DROP TABLE IF EXISTS `baby_names`;
CREATE TABLE `baby_names` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`year` int(11) DEFAULT NULL,
`first_name` varchar(100) DEFAULT NULL,
`county` varchar(100) DEFAULT NULL,
`sex` varchar(5) DEFAULT NULL,
`count` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
If you have any questions about the environment setup, leave comments on this post.
Spark JDBC Overview
When reading or writing to a JDBC database from Apache Spark, the JDBC driver needs to be in the classpath. This JDBC jar file may be added to the classpath in a few different ways. I will show one way to do it in examples below, but you should know there are more than one way to do it.
Spark Read JDBC with MySQL Example Tutorial Part 1
In this first example, I will show using spark-shell to connect the previously created mySQL database. In this code example, we’ll connect, perform a JDBC based read to create a DataFrame, register to a temp table, and then query with SQL.
Assuming you downloaded the requirements mentioned above and have your mySQL database container running with “sparksql” database and “baby_names” table created and loaded, here are the following steps to read from JDBC in Spark and query with SQL.
1. Add JDBC Driver to Spark Classpath
In this first example, I’m going to conduct the code examples from spark-shell. My SPARK_HOME environment variable is ~/dev/dev/spark-3.4.0-bin-hadoop3
I downloaded and extracted the mySQL JDBC Driver to ~/dev/mysql-connector-j-8.0.33
From my SPARK_HOME directory:
cp ../mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar jars/
You will have to update your path and JAR file name accordingly. This is just how I ran it from my environment.
What did we just do? By copying the driver jar file into the jars/ directory the Driver used to read from a database will be available in the spark-shell classpath. We’re ready to move to the next step.
2. Spark Read JDBC
Once the Spark shell is running, let’s establish a connection to mySQL db and read the baby_names table:
$ bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/05 08:42:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.1.16:4040
Spark context available as 'sc' (master = local[*], app id = local-1683294155525).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.4.0
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 16.0.2)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val bn = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/supergloo").option("driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "baby_names").option("user", "root").option("password", "example").load()
bn: org.apache.spark.sql.DataFrame = [Year: int, First Name: string ... 3 more fields]
Change the mySQL url and user/password values in the above code appropriate to your environment.
As you can hopefully see, we are creating a new DataFrame using spark.read.format
with multiple option
configuration settings. We can see from the output above, the Spark read JDBC operation was successful, but let’s take a look in the next step.
3. Confirm Spark Read JDBC Dataframe
Let’s confirm the DataFrame by show the schema of the table
scala> bn.show
+----+----------+-----------+---+-----+
|Year|First Name| County|Sex|Count|
+----+----------+-----------+---+-----+
|2007| ZOEY| KINGS| F| 11|
|2007| ZOEY| SUFFOLK| F| 6|
|2007| ZOEY| MONROE| F| 6|
|2007| ZOEY| ERIE| F| 9|
|2007| ZOE| ULSTER| F| 5|
|2007| ZOE|WESTCHESTER| F| 24|
|2007| ZOE| BRONX| F| 13|
|2007| ZOE| NEW YORK| F| 55|
|2007| ZOE| NASSAU| F| 15|
|2007| ZOE| ERIE| F| 6|
|2007| ZOE| SUFFOLK| F| 14|
|2007| ZOE| KINGS| F| 34|
|2007| ZOE| MONROE| F| 9|
|2007| ZOE| QUEENS| F| 26|
|2007| ZOE| ALBANY| F| 5|
|2007| ZISSY| ROCKLAND| F| 5|
|2007| ZISSY| KINGS| F| 27|
|2007| ZION| KINGS| M| 15|
|2007| ZION| BRONX| M| 14|
|2007| ZEV| ROCKLAND| M| 6|
+----+----------+-----------+---+-----+
only showing top 20 rows
4. Register Temp Table
Register the data as a temp table for future SQL queries
scala> bn.createOrReplaceTempView("baby_names")
5. Run SQL Query with JDBC based Data
At this point, we are ready to begin querying our JDBC based data source. A simple SQL query such as the following can be run:
scala> spark.sql("select * from baby_names").show
+----+----------+-----------+---+-----+
|Year|First Name| County|Sex|Count|
+----+----------+-----------+---+-----+
|2007| ZOEY| KINGS| F| 11|
|2007| ZOEY| SUFFOLK| F| 6|
|2007| ZOEY| MONROE| F| 6|
|2007| ZOEY| ERIE| F| 9|
|2007| ZOE| ULSTER| F| 5|
|2007| ZOE|WESTCHESTER| F| 24|
|2007| ZOE| BRONX| F| 13|
|2007| ZOE| NEW YORK| F| 55|
|2007| ZOE| NASSAU| F| 15|
|2007| ZOE| ERIE| F| 6|
|2007| ZOE| SUFFOLK| F| 14|
|2007| ZOE| KINGS| F| 34|
|2007| ZOE| MONROE| F| 9|
|2007| ZOE| QUEENS| F| 26|
|2007| ZOE| ALBANY| F| 5|
|2007| ZISSY| ROCKLAND| F| 5|
|2007| ZISSY| KINGS| F| 27|
|2007| ZION| KINGS| M| 15|
|2007| ZION| BRONX| M| 14|
|2007| ZEV| ROCKLAND| M| 6|
+----+----------+-----------+---+-----+
only showing top 20 rows
Spark JDBC Write and Read Example Part 2
Now, that we have a working example, let’s consider an example of Spark JDBC read and Spark JDBC write from a deployed application.
Rather than using the spark-shell, let’s use a Scala object called SparkSQLJDBCApp
. The source code and build environment, pretty much everything you would need is in the Github repo linked below.
In this example, the code is similar to the above, but we now need to obtain a SparkSession manually and set spark.master accordingly.
Also, this example becomes more complete because it also writes back to a JDBC source as seen in the following code:
pop.write.format("jdbc").options(jdbcWriteOptions).saveAsTable("popular_baby_names")
Here’s a video of me running it in IntelliJ and showing the results in mySQL.
As you can see in the video, I run the example from IntelliJ, but the source code is all set for you to assemble into a JAR file and deploy to a Spark cluster.
You can find more on using spark-submit here.
Conclusion Spark Read JDBC with MySQL
I created these Spark read JDBC and Spark write JDBC examples to get you up and running with Spark SQL and any JDBC compliant database quickly. Also, it’s a reference for me too because I forget many things.
What other examples would you like to see with Spark SQL and Spark JDBC? Leave ideas or questions in comments below.
Spark JDBC Examples Resources
- Source Repo – https://github.com/supergloo/spark-scala-examples/tree/main/spark-jdbc-mysql
- Apache Spark documentation using JDBC Data source https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Before you go, for more Spark tutorials, check out Spark SQL with Scala.
its showing me error that value not found sqlContext
what version of spark? do you see the sqlContext when you first startup the shell?
For me also it is giving the same error!
2.0.2 veriosn. I am using
In Spark 2, you need to `SparkSession`. See https://databricks.com/blog/2016/08/15/how-to-use-sparksession-in-apache-spark-2-0.html for more info. Hope this helps.
How could we store the queried data to a MYSQL DB?
Maybe I’m not understanding your question entirely, but you could write results back to mySQL outside of data frames of Spark SQL.
Can you please show how to write data (e.g. insert into statement) into mysql?
There is a problem when i try to do this on huge table.
although I use the right statistic for num of partitions and lower and upper bounds I am dying because
This connector generate 1 statement of ` select * from table ` instead of splitting it like it should. any idea?
I am trying to make my password secure using –password-file option but didn’t get success .can I do the same in spark for a secure access to database?
Supporting the comment above. Could you please show how to write data? Thanks in advance.
hi
Im trying to execute spark program with my sql . Its running fine withing Eclipse Ide but not able to submit the jar file . It is saying java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.load()Lorg/apache/spark/sql/DataFrame;
I even tried changing pom.xml with different versions but of no use
Could you help me out in resolving this
I am getting the error “java.lang.ClassNotFoundException: com.mysql.jdbc.Driver”
Although I have downloaded required jar and I am running spark shell as below.
./bin/spark-shell –jars mysql-connector-java-5.1.39-bin.jar
Can you please help ?
Is `mysql-connector-java-5.1.39-bin.jar` in the same directory as where you are running `./bin/spark-shell`?
I want Kafka fetching data from any RDBMS i.e. MySQL and storing in hive/hbase
Brother You Rock!!!!! Truly.. You saved my time and you made it look so simple to implement. . 🙂
Just 1 minor update is required in the article..
spark-shell –jars –> This has double dash. In your article, it looks like single dash..
But Thanks again. Keep posting 🙂
scala> val dataframe_mysql = sqlContext.read.format(“jdbc”).option(“url”, “jdbc:mysql://localhost/sparksql”).option(“driver”, “com.mysql.jdbc.Driver”).option(“dbtable”, “baby_names”).option(“user”, “root”).option(“password”, “root”).load()
executing above statement throws an error => :1: error: unclosed string literal
It seems to me your double quotes are not standard 1s.. Check the image. https://uploads.disquscdn.com/images/3f7b65174c9a8227a6e6fc91bada50effc5945afb0d9c09b635071dbbc408abb.png
🙂 Got it Thank You
Hi Todd, do you have an example for the reverse scenario? From Spark to Mysql Database?
Error:
Caused by: java.net.ConnectException: Call From host to localhost:9000 failed on connection exception:
I do not understand why it is pointing to hdfs location here ?