Wondering how to do perform stress tests with Apache Spark Thrift Server? This tutorial will describe one way to do it.
What is Apache Spark Thrift Server?
Apache Spark Thrift Server is based on the Apache HiveServer2 which was created to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster. From my experience, these “clients” are typically business intelligence tools such as Tableau and they are most often only a portion of the overall Spark architecture. In other words, the Spark cluster is primarily used for streaming and batch aggregation jobs and any JDBC/ODBC client access via Thrift Server to the cluster is secondary at best.
For more information on Apache Thrift Server and example use case, see the previous Spark Thrift Server tutorial.
Apache Spark Thrift Server Load Testing example Overview
- Confirm our environment (Spark, Cassandra, Thrift Server
- Compile our Gatling based load testing code
- Run a sample Spark Thrift load test
Setup and configure our environment of Spark, Cassandra, and Thrift Server
If you are at the point of load testing Apache Spark Thrift Server, I’m going to assume you are already familiar with the setup of Spark, Cassandra or some other backed such as Hive or Parquet. Therefore, I’m going to just run through the steps to start everything up in my local environment. Adjust the following to best match your environment.
Confirm Environment
1. Start Cassandra
For this tutorial, we’re going to use the killrweather video sample keyspace and queries created in the previous Spark Thrift Server with Cassandra post. You need to go through that tutorial first. This post assumes you have already created and loaded the data, so all we need to do now is start cassandra if it is not already running.
`$CASSANDRA_HOME/bin/cassandra`
2. Start your Spark Master, at least one Worker and the Thrift Server
If your Spark cluster is not already running, then start it up
`$SPARK_HOME/sbin/start-master.sh`
`$SPARK_HOME/sbin/start-slave.sh spark://<spark-master>:7077`
3. Start the Thrift Server and set configuration for Cassandra
`$SPARK_HOME/sbin/start-thriftserver.sh –packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2 –conf spark.cassandra.connection.host=127.0.0.1 –master spark://<spark-master>:7077`
Obtain Sample Apache Spark Thrift Server Load Tests
Clone the repo https://github.com/tmcgrath/gatling-sql
This repo contains a Gatling extension I wrote. This extension is what will allow us to load test the Spark Thrift Server.
See the src/main/resources/application.conf file for default Spark Thrift connection settings and adjust as needed.
You can run the src/test/scala/io/github/gatling/sql/example/ThriftServerSimulation.scala by invoking the Maven `test` task or build a jar and use the `launch.sh` script which will be covered in next section.
Run Load Tests
You can run the src/test/scala/io/github/gatling/sql/example/ThriftServerSimulation.scala by invoking the Maven `test` task or build a jar and use the included `launch.sh` script.
Conclusion
Hopefully, this tutorial on load testing Apache Spark Thrift Server helps get you started.
If you have any questions or ideas for corrections, let me know in the comments below.
References
- Spark Thrift Server https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
- Gatling http://gatling.io/
- Spark Thrift Server with Gatling source https://github.com/tmcgrath/gatling-sql
- Spark Monitoring tutorials
Featured Image credit https://flic.kr/p/e5hWaC