What and Why Event Logs for Data Engineers?

Why Event Logs diagram

“The idea of structuring data as a stream of events is nothing new, and it is used in many different
fields. Even though the underlying principles are often similar, the terminology is frequently
inconsistent across different fields, which can be quite confusing. Although the jargon can be
intimidating when you first encounter it, don’t let that put you off; many of the ideas are
quite simple when you get down to the core.” –Martin Kleppmann

And so begins Chapter 1 in the book “Making Sense of Stream Processing” by Martin Kleppmann.

This book can be found from numerous different sources for free, and it is the most efficient way to understand why we need Event Logs.

This amazingly succinct ~170-page book the best resource I’ve found for understanding the questions I had such as “Why do we need Event Logs?”, “How do we utilize Event Logs when designing Real-time Applications?”, “Are there different kinds of Event Logs?”, “Are Event Logs the same as Message Queue, Message Bus?”, etc.

A Personal Perspective on Event Logs

As a personal note, when I read this book the first time, I was challenged in designing a system that could handle a fairly large number of concurrent health-related measurement transactions.  Think measurements such as weight, food intake, exercise completed, etc.  On one hand, the system needed to handle these transactions as fast and as efficiently as possible.  On the other hand, the system needed to aggregate and filter these transactions into composite views such as “How many individuals, with one or more kinds of certain attributes, have recorded their weight in the last 30 minutes.?”.  For example, “how many individuals who live in Kentucky and work at a company called Bourbon, Inc. have eaten more than 2000 calories today?”

Answering these types of questions by querying the database where these transactions were recorded wasn’t fast enough.  Extracting this transactional data from the operational data store and loading it into an analytic store wasn’t fast enough either.

But, back to the book.  This book completely opened my eyes to ways I could have designed the previously mentioned architecture.  If you have struggled with anything similar, then this book is the place to start.  As mentioned, it can be found for free.  Go get it and read it.  Heck, you only have to skim Chapters 1 and 2 to get value and determine if you want to go deeper.  Chapter 3 provides more information on Change Data Capture, which we covered here before.

Types of Event Logs?

Now, I know some of you are asking.  Are Event Logs and Message Queue the same thing?  What about the differences between an Event Log and Message Bus?  Is he thinking something like Apache Kafka, Amazon Kinesis, Google Pub/Sub, Azure Event Hubs?

If you are asking these questions, the short answer is yes-and-no.  For this post, the differences are not as important as the concept of an Event Log.  Let’s cover differences in other posts.  If there are any particular areas you’d like to cover, let me know in the comments below.

Even the “Making Sense of Stream Processing” book deflects questions such as these when Martin writes, “If you want to get a bit more sophisticated, you can introduce an event stream, or a message queue, or an event log (or whatever you want to call it).”

As we get deeper into Data Engineering use cases, we can cover more of the differences between Event Logs in other places on this site.  For example, keep this in mind from the book’s Foreword.

“Whenever people are excited about an idea or technology, they come up with buzzwords to describe it. Perhaps you have come across some of the following terms, and wondered what they are about: “stream processing”, “event sourcing”, “CQRS”, “reactive”, and “complex event processing.

Sometimes, such self-important buzzwords are just smoke and mirrors, invented by companies that want to sell you their solutions. But sometimes, they contain a kernel of wisdom that can really help us design better systems.” -Neha Narkhede

Where Event Logs?

For discussion purposes and to help engrain it in my brain as well, let’s explore some diagrams.

An Application Before Event Logs (Micro Level)

Without Event Log Diagram
Without Event Log Diagram

Now, “application” (App) is intentionally left vague here.  At this point, it’s perfectly fine to consider this application to be a Web application or a Microservice or Log Collection agent or an IoT device.

An Application After Event Logs (again, Micro Level)

With Event Log Diagram
With Event Log Diagram

What are your initial reactions to these two diagrams now?

Does the “Without Event Log” diagram look faster and more straight forward?  The “With Event Log” diagram looks more complicated, and we’re not sure of the added benefits at this point, right?

There was a time when I couldn’t disagree with you, but then I faced the experience described above.  So, let’s take a look at what happens to these diagrams when we move to less isolated scenarios.  (I could have said, “let’s look at the big picture” here, but I didn’t feel like saying it that way, and this is my blog.  I get to call the shots around here.  I’m the Big blog bossman.)

Anyhow, let’s get back to the diagrams.

Over time, when more components require data integration, these diagrams evolve to:

Without Event Log

Without Event Log 2 Diagram
Without an Event Log 2 Diagram

 

With Event Log

With Event Log 2 Diagram
With an Event Log 2 Diagram

Now, which looks better?

“With an Event Log 2” diagram looks cleaner to me.

But, let’s continue looking ahead to the inevitable time with our architecture must change.

For example, what if I want to perform some calculations such as aggregations, categorizing, filtering, alerting as transaction Events occur within the system.  Well, with an event log in place and the integration components decoupled, we can add a stream processor.

Adaptability to change is another benefit to utilizing the Event Log.  In addition to adding Stream Processors, what happens if we want to integrate a SaaS application data down the line?  Well, we could simply hang it off our Event Log.

For example, notice Stream Processing and an additional SaaS component integration in the following diagram.

Event Logs Provide Flexibility Diagram
Event Logs Provide Flexibility Diagram

(By the way, why stream processors are covered in a different post.)

At this point, I’ll assume you are visually sold on the concept of the Event Log, or at least interested in exploring more. Let’s list further considerations, such as Benefits, Disadvantages, Types, and Technical Differences.  (Let me know if you have something to add in the following lists, by the way.)

Benefits of Event Logs in your Architecture

  • Loosely Coupled Integration
  • Flexible, adaptable, resilient to architectural change of requirements
  • Planning for Failure — some Event Logs are configurable to handle failures in nodes and networks gracefully
  • Events are replayable; i.e. want to retrain a model
  • Events are processed in a guaranteed order
  • Foundation for real-time processing (or as close as possible to it) with Stream Processors
  • Consistency

Disadvantages of Event Logs

  • May seem more complicated and overkill at first
  • Likely introduces a change into your architecture and way of thinking and change can be challenging

Types of Event Logs

  • Apache Kafka, Apache Pulsar
  • Message Queues (RabbitMQ, ActiveMQ, MQSeries, etc.)
  • Cloud-Native (Kinesis, Pub/Sub, Event Hubs, Confluent)

Technical Differences between Event Log implementations

There are technical trade-offs in your choice of Event Logs including

  • Producer / Consumer decoupling; i.e., multiple consumers of the same event and different points of time
  • Scaleability; how many and how fast can events be processed?
  • Replayability (ability to replay events for a particular point in time)
  • Support for Transactions
  • Exactly Once Processing
  • Resiliency — Ability to set the replication factor of events and the level of acknowledgment required for appends to be considered successful.
  • Out-of-the-box connectors for ingress and egress

 

Event Log Recommended Resources

  • Start with “Making Sense of Stream Processing” by Martin Kleppmann.  Search for it.  You can find it for free from multiple sources at the time of this writing
  • Then, move to Martin’s next book “Designing Data-Intensive Applications”

 

 

Featured Image https://pixabay.com/photos/batch-dry-firewood-forestry-logs-1868104/

Stream Processing

Event Stream Processing

We choose Stream Processing as a way to process data more quickly than traditional approaches.  But, how do we do Stream Processing?

Is Stream Processing different than Event Stream Processing?  Why do we need it?  What are a few examples of event streaming patterns?  How do we implement it?

Let’s get into these questions.

As you’ve probably noticed, the topic of stream processing is becoming popular but the idea itself isn’t new at all [1].  The ways we can implement event stream processing vary, but I believe the intention is fundamentally the same.  We use Event Stream Processing to perform real-time computations on data as it arrives or is changed or is deleted.  Therefore, the purpose of Event Stream Processing is simple.  It facilitates the real-time (or as close as we can get to real-time) processing of our data to produce faster results.  It is the answer to “why we need stream processing?”.

Why do we need stream processing?

One way to approach the answer to this question is to consider businesses. Businesses constantly want more.  They want to grow or they want to stay the same.  If they want to grow, they need to change.  If they want to stay the same, the world changes around them and they likely need to adapt to it and that means change.  It doesn’t matter.  Let’s just go with the premise that businesses constantly want more and need to change and evolve to survive.

One area where businesses want more is data processing.  They want to process data faster to make money, save money or mitigate possible risks.  Again, it’s simple.  One way to process data faster is to implement stream processing.

What is not stream processing?

Before we attempt to answer “what is stream processing”, let’s define what it isn’t.  Sound ok?  Well, I hope so.  Here goes…

In a galaxy far, far away, businesses processed data after it landed at various locations to produce a result.  This result was computed much later than when the data originally “landed”.  The canonical example is the “daily report” or the “dashboard”.   Consider a daily report or dashboard which is created once per day and displays the previous business day’s sales from a particular region.  Or similarly, consider determining a probability score if a credit card transaction was fraudulent or not, 5 days after the transaction(s) occurred.  Another example is determining a product is out-of-stock one day after the order was placed.

There are common threads in these examples.  One, the determination of the result (daily report, fraud probability score, product out-of-stock) is produced much later than when the event data used in the calculation occurred.  Businesses want it to happen faster without any quality trade-offs.

Another common thread is how these determinations are implemented in software.  This implementation described in these examples is referred to as “batch processing”.  Why is it called “Batch”?  Because of the way these example results are calculated often based on processing a group of types of data at the same time.  To create a daily report, a group of the previous day’s transactions is used at 12:30a each day.  In the batch processor, multiple transactions are used from beginning to end of the processing.  To determine possible fraud, a group of the previous 5 days’ worth of transactions is used in the batch process.  While processing a group of the previous 8 hours of order transactions (in batch), one order, in particular, is determined not available in the current inventory.

Wouldn’t it be faster to produce the desired results as the data as it is created?  The answer is Yes.  Yes, the report or dashboard results could be updated sooner if they were updated as each order event occurred.  Yes, the fraud probability determination would be much quicker if the most recent transaction event was processed immediately after it happened.  And finally, yes, the out of inventory event could be determined much more quickly if the associated order event was processed at the time it was attempted.

As you’ve probably guessed, the underlines on the word “event” above are intentional.  Here is a spoiler alert.  You need to process events in stream processing.  That’s why stream processing is often referred to as “event stream processing”.

Stream Processing is not Batch Processing.

What is Stream Processing?

Let’s take a look at a couple of diagrams.  Here’s an example of how Stream Processing may fit into an existing architecture.

 

Event Stream Processing diagram example
Event Stream Processing diagram example

Origin events are created from downstream applications such as an E-commerce site or a green screen terminal.  Events are captured and stored in various formats including RDBMS, NoSQL, Mainframes, Log files. Next, these events (inserts, updates, deletes, appends to a log file, etc.) are written into an Event Log. Now, our Stream Processor(s) can be invoked and consume data from one or more Topics in the Event Log and process these events. Processing examples include calculating counts over a specified time range, filtering, aggregating and even joining multiple streams of Topic events to enrich into new Topics. There are many tutorial examples of Stream Processors on this site.

By the way, for more information, see the post on Event Logs.

The example diagram above is ideal when implementing Stream Processing into existing architectures.  The “Origin Events Recaptured”, Event Log and Stream Processors can be implemented and evolved which is beneficial for numerous reasons.

What if we are creating a brand-new, greenfield application?  And what if this application includes Microservices?  Stream Processors can fit nicely here as well as shown in the following diagram.

Event Stream Processing with Microservices Architecture Diagram
Event Stream Processing with Microservices Architecture Diagram

 

As you can see, Stream Processors can be implemented of the Event Log used in Microservice architectures as well.  It is important to note the changes to the arrows in this Microservice example above.  Notice how the arrows indicate data flow To and From the Event Log now.  In this example, Stream Processors are likely creating curated or enriched data Topics that are being consumed from various Microservices and/or landed in some downstream storage such as HDFS or an object store like S3 or some type of Data Lake.

What do most businesses software architectures look like today?

If you could combine both the diagrams above into one, that is what I see as most common.  Also, there are variances of these diagrams when implementing multiple data center solutions such as Hybrid Cloud.

As shown above, one key takeaway of stream processing is determining if you will or will not implement an Event Log.  Because I hear some of you and yes, I guess we could build stream processors without an Event Log, but why? Seriously, let us know in the comments below.

The Event Log will be a distributed transaction, ordered, append-only log for all your streaming architectures.  The Event Log is our lynchpin for building distributed, flexible streaming systems are there are various options available. But again, the log is nothing new.  Again, the concept isn’t new. Many RDBMS use changelogs (or WAL “write-ahead logs”) to improve performance, provide point-in-time-recovery (PITR) hooks for disaster recovery and finally, the foundation for distributed replication.

In most cases, I believe you’ll benefit from choosing to utilize a distributed event log, but there are cases where you may wish to stream directly and checkpoint along the way.  We’ll cover the Event Log in more detail in other areas of this site.

Can we Debate what a Stream Processor is Now?

These are just my opinions and I’m curious if you have something to add here. Do you see stream processing as something else?  Let me know.  As you can tell from above, I consider two requirements to be an Event Log and Stream Processing Framework or Application.  When designing stream processors, I write the stream processor computed results back to the Event Log rather than directly writing the results to a data lake or object store or database.  If the sinking of the stream processor results is required, do it someplace else outside of the stream processor itself; e.g. use a Kafka Connector sink to write to S3, Azure Blog Storage, Cassandra, HDFS, BigQuery, Snowflake, etc.

We have options for the implementation of our Event Log and Stream Processor compute.

Event Logs are most likely implemented in Apache Kafka, but the approach led by Apache Kafka has been expanded into other options as well.

For Stream Processors, our options include Kafka Streams, kSQLdb, Spark Streaming, Pulsar Functions, StreamSets, Nifi, Beam, DynamoDB Streams, Databricks, Flink, Storm, Samza, Google Cloud Dataflow.  What else?  What am I missing?  Let me know in the comments below.

On this site, we’ll deep dive into all these implementations examples and more.

For now, check out the tutorial sections of the site

Regardless of your implementation, there are certain constructs or patterns in stream processing which we should know or at minimum, be aware of for future reference.

Stream Processor Constructs

  • Ordering – do we need to process each event in strict order or is it perfectly fine to compute out of order in another case?
  • Event time vs processing time – must consider event vs processing time, an example could be calculating the average temperature every 5 minutes or average stock price over the last 10 minutes
  • Stream Processor Windows – in concert the meaning of time, perform calculations such as sums, averages, max/min.  An example could be calculating the average temperature every 5 minutes or average stock price over the last 10 minutes
  • Failover and replication
  • Joins – combining data from multiple sources and attempt to infer from or enrich into a new event

Stream Processing Vertical Applications

Where are Stream Processing solutions most prevalent?  The short answer is everywhere.  Stream processors are beneficial in healthcare, finance, security, IoT, Retail, etc. because as mentioned in the beginning, every industry wants data to be faster.

Specific solutions that span multiple verticals include:

  • Refined, harmonized, filter, streaming ETL – streaming from raw to filtered; transformations; anonymization, protected
  • Real-Time Inventory, Recommendations, Fraud Detection, Alerts, Real-Time XYZ, etc
  • Specialized – for example, stream data to search implementations such as Elastic or Solr
  • Machine Learning / AI – streaming data pipelines to training data repositories as well as running through a machine learning model(s)
  • Monitoring and alerting
  • Hybrid or Multi-cloud

Conclusion

Streaming Architecture will be interesting for years to come.  Let’s have some fun exploring and learning.

Additional Resources

[1] The concept described above isn’t necessarily new, as you’ll see in the following example shown in SQL https://en.wikipedia.org/wiki/Event_stream_processing.

 

 

Image credit https://pixabay.com/en/seljalandsfoss-waterfall-iceland-1751463/