Let’s start Apache Spark Streaming by building up our confidence with small steps. These small steps will create the forward momentum needed when learning new skills. The quickest way to gain confidence and momentum in learning new software development skills is executing code that performs without error. Right? I mean, right!? This is pure software psychology here. Dropping pearls of wisdom here folks, pearls I tell you, pearls.
In this post, we’re going to set up and run Apache Spark Streaming with Scala code. Then, we should be confident in taking the next step to Part 2 of learning Apache Spark Streaming.
Before we begin though, I assume you already have a high-level understanding of Apache Spark Streaming at this point, but if not, check out the Spark Streaming tutorials or Spark Streaming with Scala section of this site. Also, here’s a quick two-minute read on Spark Streaming (opens in new window) from the Learning Apache Spark Summary book.
Spark comes with some great examples and convenient scripts for running Streaming code. Let’s make sure you can run these examples. In case it helps, I made a screencast of me running through these steps. Link to the screencast below.
Running the NetworkWordCount example out-of-the-box
- Open a shell or command prompt on Windows and go to your Spark root directory.
- Start Spark Master: sbin/start-master.sh **
- Start a Worker: sbin/start-slave.sh spark://todd-mcgraths-macbook-pro.local:7077
- Start netcat on port 9999: nc -lk 9999 (*** Windows users: https://nmap.org/ncat/ Let me know in page comments if this works well on Windows)
- Run network word count using handy run-example script: bin/run-example streaming.NetworkWordCount localhost 9999
** Windows users, please adjust accordingly; i.e. sbin/start-master.cmd instead of sbin/start-master.sh
Here’s a screencast of me running these steps
Making and Running Our Own NetworkWordCount
Ok, that’s good. We’ve succeeded in running the Scala Spark Streaming NetworkWordCount example, but what about running our own Spark Streaming program in Scala? Let’s take another step towards that goal. In this step, we’re going to setup our own Scala/SBT project, compile, package and deploy a modified NetworkWordCount. Again, I made a screencast of the following steps with a link to the screencast below.
- Choose or create a new directory for a new Spark Streaming Scala project.
- Make dirs to make things convenient for SBT: src/main/scala
- Create Scala object code file called NetworkWordCount.scala in src/main/scala directory
- Copy-and-paste NetworkWordCount.scala code from Spark examples directory to your version created in the previous step
- Remove or comment out package and StreamingExamples references
- Change AppName to “MyNetworkWordCount”
- Create a build.sbt file (source code below)
- sbt compile to smoke test
- Deploy: ~/Development/spark-1.5.1-bin-hadoop2.4/bin/spark-submit – class “NetworkWordCount” – master spark://todd-mcgraths-macbook-pro.local:7077 target/scala-2.11/streaming-example_2.11-1.0.jar localhost 9999
- Start netcat on port 9999: nc -lk 9999 and start typing
- Check things out in the Spark UI
name := "streaming-example" version := "1.0" scalaVersion := "2.11.4" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.5.1", "org.apache.spark" %% "spark-streaming" % "1.5.1" )
If you watched the video, notice this has been corrected to “streaming-example” and not “steaming-example” 🙂
Spark Streaming With Scala Part 1 Conclusion
At this point, I hope you were successful in running both Spark Streaming examples in Scala. If so, you should be more confident when we continue to explore Spark Streaming in Part 2. If you have any questions, feel free to add comments below.
You may also find the following landing page helpful for more information on Spark and Spark with Scala and Python.
Featured image credit https://flic.kr/p/bVJF32