Stream Processor Windows


When moving to stream processing architecture or building stream processors, you will soon face two choices.  Will you process streams on an individual, per event basis?  Or, will you collect and buffer multiple events/messages first, and then apply a function or join results to this collection of events?

Examples of single event processing might be the current GPS location, temperature, removing PII or enriching a record with address information.

Conversely, examples of the processing multiple events, which involves computing results across multiple events, including weblog traffic for breakout pages clicked during a session in order to make a recommendation, metrics from mobile or IoT devices such temperature readings over the last 5 minutes, fraud detection based on events over a set time period, and behavior analysis of what is added and removed from a shopping cart while visiting a particular website.  This is the opposite of single event processing where any associated context of the event is irrelevant.

But, back to the idea of processing events as a group of events rather than individually.  This implies stream processor implementations provide the capability of 1) gathering multiple events and 2) performing some kind of computation function on the collection of events.  Different implementations provide variances around these two fundamental constructs.  This high-level capability when designing stream processors is called window-based operations or more succinctly, windowing.

Table of Contents

Stream Processor Windows Overview

Window operations define boundaries to create finite sets of events and then perform functions against the set of events.  Sometimes these sets are called segments or collections or buckets.  The assignment of events into buckets is based on time or properties inherent in the event data.  Critical concept: There can be two notions of “time”: event time and processing time.

Event time is when the event occurred vs processing time which is the time when even was processed.  Do not overlook the differences between event and processing time.  Ideally, these two values are the same, but that’s not reality and the time will have different degrees of skew over time.  For a great explanation of these differences and how skew occurs, check out the Streaming 101 post.

See also  Streaming Data Engineer Use Cases

Windowing needs set boundaries for bucketing and how often the window produces a result.  Result functions may be aggregated, such as the total time spent in a particular location or other common functions such as max, min, mean, median, standard deviation, etc.

Types of Stream Processor Windows

There are four types of windowing.  Some of the most commonly used windowing methods implemented in streaming engines include tumbling, hopping sliding and session windows.

Tumbling Windows

Tumbling windows segment events into fixed sizes and then perform a function against the events in the collection.  Tumbling window segmentation may be based on a particular count of elements or a set period of time.   A key differentiator in Tumbling windows is there is no overlap between windows.  In other words, unlike other types of windows, an event can only be part of one particular window.  This differs from Hopping windows, which is the next type of window described.

Hopping Windows

Hopping windows are based on fixed time period intervals.  Hoping window results may overlap with other windows, so events may belong to more than one window processing result.  Hopping windows are defined by window time size (e.g. 5 minutes) and the advance interval or “hop” (e.g. 1 minute).   A Hopping window may be configured to be the same as a Tumbling window if the hop size to be the same as the window time size.  This will result in no overlaps and thus, all events will be part of only one bucket.

Sliding Windows

From my research, sliding vs tumbling windows is a bit of a debate.

See also  Streaming Analytics in 2023 - What, Why, and How

Sliding windows produce an output result only when an event occurs which is different than Tumbling or Hopping windows.  A good graphic comparing and contrasting Sliding vs Tumbling windows can be seen here.

In Kafka Streams, sliding windows are used only for `join` operations.  There doesn’t appear to be this distinction in Spark Streaming.

Similarities to previously described Tumbling and Hopping windows include the notion an event might belong to multiple windows and windows are defined by size and hop size.

Session Windows

Grouping of events originating from the same period of user activity or session is commonly used for behavior analysis.  User activities or “sessions” can be collected using a session window.  A session window starts when the first event occurs and remains open until a timeout value is reached. If another event occurs within the specified timeout from the previously ingested event, the session window extends to ingest the new event. If no events occur within the timeout, the session window is closed.

A common example of a session is a user browsing a website.  A user may enter a search term, compare different product pages, add one or more items to their shopping cart and eventually checkout or abandon their session without purchase.

Windowing Conclusion and Resources

You will face a variety of options when considering streaming applications and stream processing architectures.   If you need to process and produce results from a bucket of events rather than processing one event at a time, you will need stream processor windows.  Depending on your streaming implementation, you will have variances in what window operations are available to utilize.   The following links may help in learning more about stream processing windows from different perspectives.

See also  Schema Registry in Data Streaming [Options, Choices, Comparisons]

Stream Processor Window Examples

Kafka Stream Joins has examples of setting windows.

Kafka Streams Windowing

Spark Streaming Windowing

Flink Windowing

Beam Windowing

About Todd M

Todd has held multiple software roles over his 20 year career. For the last 5 years, he has focused on helping organizations move from batch to data streaming. In addition to the free tutorials, he provides consulting, coaching for Data Engineers, Data Scientists, and Data Architects. Feel free to reach out directly or to connect on LinkedIn

Leave a Comment