What is stream processing? What is a streaming architecture? Why should you consider it? How do you implement it?
As you’ve probably noticed, the topic of stream processing is becoming popular. If you look around, you’ll see businesses are being formed around the associated open source streaming architecture technologies. You’ll notice traditional, mainstream ETL (extract, transform, load) is being reevaluated and transitioned to new ones such as ELT (extract, load, transform). Distributed computing, storage and choosing two of three from the CAP theorem are the new norms.
When considering stream processing from a historical perspective, is it innovation vs invention? In my opinion, it’s innovation. It’s a twist on familiar constructs.
The stream processing paradigm is similar to event stream processing. And when we consider streaming architecture concepts such as joining streams, it seems familiar to the proposed benefits of using views in relational database management systems. But, let’s not go too deep here. We’ll dissect similarities between old and new in more specific topic explorations listed below.
One key takeaway of your streaming architecture is determining if you will or will not implement a specialized log. This log will be a distributed transaction, ordered, append-only log for all your streaming architectures. The log is our lynchpin for building distributed, streaming systems and includes implementations in Apache Kafka, Apache Pulsar, AWS Kinesis, and others. But again, the log is nothing new. Many RDBMS use change logs (or WAL “write ahead logs”) to improve performance, provide point-in-time-recovery (PITR) hooks for disaster recovery and finally, the foundation for distributed replication. In most cases, I believe you’ll benefit from choosing to utilize a distributed log, but there are cases where you may wish to stream directly and checkpoint along the way.
Why Streaming Architecture?
First, let’s not overcomplicate the reasons why we are moving to a streaming architecture. It’s all about speed. Right? It’s part of the constant quest for bigger, faster, stronger. Kaizen if you please. That’s software. By implementing streaming architectures, we can be more reactive in analytics, monitoring, near real-time alerting and machine learning. Just to name a few. If you’re not getting faster and quicker, you’ll be slower than the others that are getting better. There’s just not much to debate here.
Stream Processing Patterns AND Requirements
Streaming patterns and constructs are starting to emerge. The following list will surely evolve over time, but for now, let’s start here
- Raw, unfiltered, ELT – Example: loading up a raw zone in a data lake
- Refined, harmonized, filter, streaming ETL – streaming from raw to filtered; transformations; anonymization, protected
- Joins – combining data from multiple sources and attempt to infer an event
- CDC – change data capture of transactional databases to analytics or detection analysis
- Specialized – for example, stream data to search implementations such as Elastic or Solr
- Lambda or Kappa architectures
- Time Considerations – must consider event vs processing time, an example could be calculating the average temperature every 5 minutes or average stock price over the last 10 minutes
- Windows and Aggregations – in concert the meaning of time, perform calculations such as sums, averages, max/min. An example could be calculating the average temperature every 5 minutes or average stock price over the last 10 minutes
- Machine Learning / AI – streaming data pipelines to training data repositories as well as running through machine learning model(s)
- Monitoring and alerting
Of course, there are variances and the possibility streaming data pipelines will cover many of these patterns at once.
Implementing Streaming Architectures
As previously suggested, the key component in streaming architectures may be a distributed log. Some examples of distributed logs include Apache Kafka, Apache Pulsar, AWS Kinesis, Google Cloud Dataflow, etc.
Next, we need the mechanism to write and read to distributed logs. To put another way, if we consider each record in the log a “message”, we need a way to produce and consume messages from a log. Again, there are options to choose from such as StreamSets, Spark Streaming, Kafka Streams, DynamoDB Streams, etc.
Finally, on top of some of these lower level implementation options for consuming and producing, there are frameworks such as Apache Beam or Apache Flink. These frameworks build data pipelines using existing distributed processing backends. For example, these may choose to utilize underlying technologies such as Spark Streaming.
On this site, we’ll deep dive into all these implementations examples and more.
For further reference, check out the tutorial sections of the site
Stream Processor Constructs
- Event time vs processing time
- Stream Processor Windows
- Failover and replication
Streaming Architecture will be interesting for years to come. Let’s have some fun exploring and learning.
Image credit https://pixabay.com/en/seljalandsfoss-waterfall-iceland-1751463/