Apache Spark Thrift Server Load Testing Example

Wondering how to do perform stress tests with Apache Spark Thrift Server? This tutorial will describe one way to do it.

What is Apache Spark Thrift Server?

Apache Spark Thrift Server is based on the Apache HiveServer2 which was created to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster. From my experience, these “clients” are typically business intelligence tools such as Tableau and they are most often only a portion of the overall Spark architecture. In other words, the Spark cluster is primarily used for streaming and batch aggregation jobs and any JDBC/ODBC client access via Thrift Server to the cluster is secondary at best.

For more information on Apache Thrift Server and example use case, see the previous Spark Thrift Server tutorial.

Apache Spark Thrift Server Load Testing example Overview

How do simulate anticipated load on our Apache Spark Thrift Server? In this post, we are going to use an open source tool called Gatling. Check out the References section at the bottom of this post for links to Gatling.

At a high level, this Spark Thrift with Gatling tutorial will run through all the following steps:

Confirm our environment (Spark, Cassandra, Thrift Server
Compile our Gatling based load testing code
Run a sample Spark Thrift load test

Setup and configure our environment of Spark, Cassandra, and Thrift Server

If you are at the point of load testing Apache Spark Thrift Server, I’m going to assume you are already familiar with the setup of Spark, Cassandra or some other backed such as Hive or Parquet. Therefore, I’m going to just run through the steps to start everything up in my local environment. Adjust the following to best match your environment.

Confirm Environment

1. Start Cassandra

For this tutorial, we’re going to use the killrweather video sample keyspace and queries created in the previous Spark Thrift Server with Cassandra post. You need to go through that tutorial first. This post assumes you have already created and loaded the data, so all we need to do now is start cassandra if it is not already running.

`$CASSANDRA_HOME/bin/cassandra`

2. Start your Spark Master, at least one Worker and the Thrift Server

If your Spark cluster is not already running, then start it up

`$SPARK_HOME/sbin/start-master.sh`

`$SPARK_HOME/sbin/start-slave.sh spark://<spark-master>:7077`

3. Start the Thrift Server and set configuration for Cassandra

`$SPARK_HOME/sbin/start-thriftserver.sh –packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2 –conf spark.cassandra.connection.host=127.0.0.1 –master spark://<spark-master>:7077`

Obtain Sample Apache Spark Thrift Server Load Tests

Clone the repo https://github.com/tmcgrath/gatling-sql

This repo contains a Gatling extension I wrote. This extension is what will allow us to load test the Spark Thrift Server.

See the src/main/resources/application.conf file for default Spark Thrift connection settings and adjust as needed.

You can run the src/test/scala/io/github/gatling/sql/example/ThriftServerSimulation.scala by invoking the Maven `test` task or build a jar and use the `launch.sh` script which will be covered in next section.

Run Load Tests

You can run the src/test/scala/io/github/gatling/sql/example/ThriftServerSimulation.scala by invoking the Maven `test` task or build a jar and use the included `launch.sh` script.

Conclusion

Hopefully, this tutorial on load testing Apache Spark Thrift Server helps get you started.

If you have any questions or ideas for corrections, let me know in the comments below.

References

Spark Thrift Server https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
Gatling http://gatling.io/
Spark Thrift Server with Gatling source https://github.com/tmcgrath/gatling-sql

Spark Monitoring tutorials

Featured Image credit https://flic.kr/p/e5hWaC