Spark Scala Overview
Spark provides developers and engineers with a Scala API. The Spark tutorials with Scala listed below cover the Scala Spark API within Spark Core, Clustering, Spark SQL, Streaming, Machine Learning MLLib and more.
You may access the tutorials in any order you choose.
The tutorials assume a general understanding of Spark and the Spark ecosystem regardless of the programming language such as Scala. If you are new to Apache Spark, the recommended path is starting from the top and making your way down to the bottom.
If you are new to both Scala and Spark and want to become productive quickly, check out my Scala for Spark course.
New Spark Tutorials are added here often, so make sure to check back often, bookmark or sign up for our notification list which sends updates each month.
Apache Spark Essentials
- Spark Scala Overview
- Spark SQL with Scala
- Spark SQL with Scala Tutorials
- Spark Streaming with Scala
- Spark Machine Learning
- Spark Performance Monitoring and Debugging
- Spark with Scala Integration Tutorials
- Spark Operations
To become productive and confident with Spark, it is essential you are comfortable with the Spark concepts of Resilient Distributed Datasets (RDD), DataFrames, DataSets, Transformations, Actions. In the following tutorials, the Spark fundaments are covered from a Scala perspective.
Spark Scala Tutorials
With these three fundamental concepts and Spark API examples above, you are in a better position to move any one of the following sections on clustering, SQL, Streaming and/or machine learning (MLlib) organized below.
Spark applications may run as independent sets of parallel processes distributed across numerous nodes of computers. Numerous nodes collaborating together is commonly known as a “cluster”. Depending on your version of Spark, distributed processes are coordinated by a SparkContext or SparkSession. The SparkContext can connect to several types of cluster managers including Mesos, YARN or Spark’s own internal cluster manager called “Standalone”. Once connected to the cluster manager, Spark acquires executors on nodes within the cluster.
The following Spark clustering tutorials will teach you about Spark cluster capabilities with Scala source code examples.
- Cluster Part 1 Run Standalone
- Cluster Part 2 Deploy a Scala program to the Cluster
- Spark Cluster Deploy Troubleshooting
- Accumulators and Broadcast variables
For more information on Spark Clusters, such as running and deploying on Amazon’s EC2, make sure to check the Integrations section at the bottom of this page.
Spark SQL with Scala
Spark SQL is the Spark component for structured data processing. Spark SQL interfaces provide Spark with an insight into both the structure of the data as well as the processes being performed. There are multiple ways to interact with Spark SQL including SQL, the DataFrames API, and the Datasets API. Developers may choose between the various Spark API approaches.
Spark SQL queries may be written using either a basic SQL syntax or HiveQL. Spark SQL can also be used to read data from existing Hive installations. When running SQL from within a programming language such as Python or Scala, the results will be returned as a DataFrame. You can also interact with the SQL interface using JDBC/ODBC.
A DataFrame is a distributed collection of data organized into named columns. DataFrames can be considered conceptually equivalent to a table in a relational database, but with richer optimizations. DataFrames can be created from sources such as CSVs, JSON, tables in Hive, external databases, or existing RDDs.
A Dataset is a new experimental interface added in Spark 1.6. Datasets try to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine.
Spark SQL with Scala Tutorials
Readers may also be interested in pursuing tutorials such as Spark with Cassandra tutorials located in the Integration section below. Spark with Cassandra covers aspects of Spark SQL as well.
Spark Streaming with Scala
Spark Streaming is the Spark module that enables stream processing of live data streams. Data can be ingested from many sources like Kinesis, Kafka, Twitter, or TCP sockets including WebSockets. The stream data may be processed with high-level functions such as `map`, `join`, or `reduce`. Then, processed data can be pushed out of the pipeline to filesystems, databases, and dashboards.
Spark’s MLLib algorithms may be used on data streams as shown in tutorials below.
Spark Streaming receives live input data streams by dividing the data into configurable batches.
Spark Streaming provides a high-level abstraction called discretized stream or “DStream” for short. DStreams can be created either from input data streams or by applying operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Spark Streaming with Scala Tutorials
- Spark Streaming Overview
- Spark Streaming Example Streaming from Slack
- Spark Streaming with Kafka Tutorial
- Spark Structured Streaming with Kafka including JSON, CSV, Avro, and Confluent Schema Registry
- Spark Streaming with Kinesis Example
- Spark Streaming Testing
Spark Machine Learning
MLlib is Spark’s machine learning (ML) library component. The MLlib goal is to make machine learning easier and more widely available. It consists of popular learning algorithms and utilities such as classification, regression, clustering, collaborative filtering, dimensionality reduction.
Spark’s MLlib is divided into two packages:
- spark.mllib which contains the original API built over RDDs
- spark.ml built over DataFrames used for constructing ML pipelines
spark.ml is the recommended approach because the DataFrame API is more versatile and flexible.
Spark MLlib with Scala Tutorials
Spark Performance Monitoring and Debugging
- Spark Performance Monitoring with Metrics, Graphite and Grafana
- Spark Performance Monitoring Tools – A List of Options
- Spark Tutorial – Performance Monitoring with History Server
- Scala Spark Debugging in IntelliJ
Spark with Scala Integration Tutorials
The following Scala Spark tutorials build upon the previously covered topics into more specific use cases
- Spark Amazon S3 Tutorial
- Spark Deploy to an EC2 Cluster Tutorial
- Spark Cassandra from Scala Tutorial
- Spark Scala in IntelliJ
- Apache Spark Thrift Server with Cassandra Tutorial
- Apache Spark Thrift Server Load Testing Example
The following Scala Spark tutorials are related to operational concepts
Featured Image adapted from https://flic.kr/p/7zAZx7