Streaming

Streaming data refers to a constant flow of data coming from various sources, often simultaneously and in small sizes. This data can come from a range of sources, such as customer log files, e-commerce transactions, social media activity, financial market data, and connected device measurements.

Streaming data tends to want to be analyzed in real time, either record by record or over a certain time period in what is referred to as a window.

Lessons Learned in Data Streaming

Streaming computations, transformations, can used for a variety of purposes, including correlations, aggregations, filtering, and sampling. By analyzing streaming data, companies can gain insight into various aspects of their business and customer behavior, such as service usage, website clicks, and the location of devices and physical goods. This allows them to respond quickly to changing situations, such as tracking changes in public sentiment on social media and taking timely action as needed.

Streaming data processing can bring many benefits to businesses that generate large amounts of constantly changing data. This type of processing is applicable to a wide range of industries and use cases involving big data.

Initially, companies may use streaming data processing for simple tasks such as collecting system logs and performing basic calculations. However, these applications can evolve to become more sophisticated and near-real-time. At first, they may be used to produce simple reports or trigger alarms when certain thresholds are reached. As they become more advanced, they can be used to apply machine learning algorithms and extract deeper insights from the data. With time, more complex algorithms like decaying time windows can be used to uncover even more insights from the data streams.

It’s helpful to compare and contrast batch processing and stream processing when working with streaming data. Batch processing involves running arbitrary queries on a set of data and is often used for deep analysis of large data sets. Platforms like Amazon EMR, which use MapReduce, are examples of systems that support batch jobs. On the other hand, stream processing involves continuously ingesting and analyzing data as it arrives, and updating metrics, reports, and summary statistics in real time. It is better suited for monitoring and responding to data in real time.

To summarize the differences between batch and stream processing: batch processing typically involves large amounts of data and can be used for complex analytics, while stream processing involves smaller amounts of data and is used for simple responses, aggregates, and rolling metrics. Batch processing can have latencies of minutes to hours, while stream processing requires latencies of seconds or milliseconds.

Working with streaming data can present a number of challenges, including the need to have both a storage layer and a processing layer. The storage layer must support efficient and re-playable reads and writes of large amounts of data, as well as record ordering and strong consistency. The processing layer consumes data from the storage layer, performs calculations on it, and then informs the storage layer when data is no longer needed. Additionally, it is important to consider scalability, data durability, and fault tolerance in both the storage and processing layers. To address these challenges, a number of platforms have been developed, such as Apache Kafka, Amazon Kinesis Data Streams for the stream storage layer and Apache Flink, Apache Spark Streaming, Kafka Connect, etc. for the stream processing layer.

For more on Data Streaming, see the articles and tutorials below.

Data Streaming 101 (and Real-Time Data Processing)

June 27, 2023 by Todd M

Data streaming is the process of continuously transmitting data from a source to a destination in real-time. It can be a method for transmitting large amounts of data quickly and efficiently vs a more traditional method of accumulating data over time, and then transmitting in scheduled batches. As with most options in software architecture, there … Read more

Streaming Analytics in 2023 – What, Why, and How

March 21, 2023 by Todd M

Streaming analytics continues to become more important because it lets businesses learn new things and make decisions in almost real time. This is especially relevant in fields like finance, health care, and manufacturing where the amount of time needed to make decisions is very critical. By the way, when measuring time in streaming analytics, you’ll … Read more

Open Source Change Data Capture in 2023

December 29, 2022 by Todd M

Let’s consider three open source change data capture (CDC) options ready for production in the year 2023. Before we begin, let’s confirm we all see the CDC trend. To me, it seems everywhere you look these days is all about change data capture. From my perspective that wasn’t the case for many years. Do you … Read more

Streaming Data Engineer Use Cases

October 11, 2022 by Todd M

As a streaming data engineer, we face many data integration challenges such as “How do we integrate this SaaS with this internal database?”, “Will a particular integration be real-time or batch?”, “How does the system we design recovery from possible failures?” and “If anyone has ever addressed a situation similar to mine before, how did … Read more

Schema Registry in Data Streaming [Options, Choices, Comparisons]

August 30, 2023October 5, 2022 by Todd M

A schema registry in data streaming use cases such as micro-service integration, streaming ETL, event driven architectures, log ingest stream processing, etc., is not a requirement, but there are numerous reasons for implementing one. The reasoning for schema registries in data streaming architectures are plentiful and have been covered extensively already. I’ve included some of … Read more

What and Why Event Logs?

April 15, 2023April 28, 2020 by Todd M

Before we begin diving into event logs, let’s start with a quote from one of my software heroes. “The idea of structuring data as a stream of events is nothing new, and it is used in many differentfields. Even though the underlying principles are often similar, the terminology is frequentlyinconsistent across different fields, which can … Read more

Stream Processing

April 13, 2023April 21, 2020 by Todd M

We choose Stream Processing as a way to process data more quickly than traditional approaches. But, how do we do Stream Processing? Is Stream Processing different than Event Stream Processing? Why do we need it? What are a few examples of event streaming patterns? How do we implement it? Let’s get into these questions. As … Read more

Stream Processor Windows

April 14, 2023January 31, 2019 by Todd M

When moving to stream processing architecture or building stream processors, you will soon face two choices. Will you process streams on an individual, per event basis? Or, will you collect and buffer multiple events/messages first, and then apply a function or join results to this collection of events? Examples of single event processing might be … Read more

Change Data Capture – What Is It? How Does it Work?

April 14, 2023December 4, 2018 by Todd M

Change Data Capture is a mechanism to capture the changes in databases so they may be processed someplace other than the database or application(s) which made the change. This article will explain what change data capture (CDC) is, how it works, and why it’s important for businesses. Why? Why would we want to capture changes … Read more