As a streaming data engineer, we face many data integration challenges such as “How do we integrate this SaaS with this internal database?”, “Will a particular integration be real-time or batch?”, “How does the system we design recovery from possible failures?” and “If anyone has ever addressed a situation similar to mine before, how did they do it?”. Data engineers may also have to buck certain trends such as converting everything into a streaming workload when perhaps it’s better suited for batch anyhow. In any case, let’s have some fun and start exploring streaming data engineering at a high level and then dive into more specific uses cases.
Please leave a comment below if I’m missing any.
Streaming Data Engineer Use Case Examples
- Streaming Data Engineer Overview
- Event-Driven Architecture (EDA)
- Messaging Systems: Event Logs or Message Queues
- Change Data Data Capture (CDC)
- Event Stream Processing
- Why does the world need Streaming Data Engineers?
- What is a Data Engineer?
- How do you become a Data Engineer?
Streaming Data Engineer Overview
Streaming Data Engineers or Data Architects are usually responsible for designing solutions to the previous example questions. Now, I admit, there are variances in the people’s titles who are responsible for addressing these questions, but the intent is still the same. For example, we might be a Solution Engineer, Solution Architect, Cloud Architect or ETL Developer just to name a few. Or, do you remember when it was more common to refer to back-end engineer vs. front-end?
The point is, “in the end, it doesn’t even matter”. That’s right, I just dropped in some lyrics from the song “In the End” by Linkin Park. Don’t even think about it, because it doesn’t matter.
One of the goals of this site is to describe these use cases and data engineering patterns. Then, with these use cases and patterns in mind, we can take the next step and provide implementation examples from our various technical options such as Spark, Kafka, etc. I hope you learn from this site because I know I learn from creating the content and from your comments and questions.
Event-Driven Architecture (EDA)
Event-Driven Architectures strive for loose coupling between components as opposed to tight coupling between components for data integration. When you consider a “component”, think of a database, monolith application, micro-service, 3rd party SaaS service, mainframe, log file, etc. Next, as you may gather from the name, event-driven architectures rely on the concept of an Event. The “Event” represents a change of state within your system; e.g. an order was placed, an order was shipped, a temperature reading at a particular time, a person started a fitness run, a person lifted a kettlebell, etc.
Events are appended to an Event Log or consumed from an Event Log. The Event Log is the foundation for loosely coupled integrations between components. In particular, the various components that append and/consume Events from the Event Log are unaware of each other and thus, loosely coupled. This offers tremendous flexibility, but also creates questions that must be answered.
For a trivial example, consider an E-Commerce Order processor that creates an Order Event and appends it an Event Log. Later, this Order Event is consumed from the Event Log by a Shipping processor. This Shipping processor’s responsibility might include ensuring the customer receives their order. Another consuming processor called the Customer Service processor might consume the order event from the Event Log to send the customer an email to thank them for their order. In this example, the process of transacting an order is blissfully unaware of both the Shipping and Customer Service processor. They are loosely coupled today and also flexible to evolve in the future. For example, an Inventory processor might be added to perform actions when certain inventory threshold levels are reached.
There is much much more to EDA and “it ain’t all flowers” as some folks say, but as an architect, flexible integration of loosely connected components is a good place to start.
Messaging Systems: Event Logs or Message Queues
A required component in EDA is a Messaging System. There are typically two types of Messaging Systems: Event Logs and Message Queues. You might know these as different names such as Message Bus and I’m open to feedback as naming is important for clarity. I’m an old dog now, so I tend to see the similarity of intent or design in things even though they might be known as different names. Regardless of the name you use, there are fundamental differences in an Event Log and Queues which we will cover in more depth.
Event Logs store immutable events. For recording new events, the Event Log is appended from various Producers. To read new events from the Event log, Consumers are created to read from one or more Event Logs. Again, there can be different terminology used here. For example, these reads from Consumers are also known as Subscribers. Events can be called Messages. The meaning can be different based on context so be careful.
Events can be organized by their attributes into Topics which is analogous to how data can be segmented into tables within a database. In this way, Producers can be configured to write to certain topics and Consumers can be configured to read from certain topics.
For more, see What and Why Event Logs?
Change Data Data Capture (CDC)
Change Data Capture (CDC) tracks changes in source databases and transports these changes to downstream systems such as data warehouses, data lakes and/or stream processors. Usually, the source databases are online transactional processing (OLTP) data systems. For example, mutations in an E-commerce database such as inserts, updates or deletes would be consumed via CDC and transmitted downstream to an event log and/or analytic systems.
For more, see Change Data Capture, What is It and Why you May Want It?
Event Stream Processing
Stream Processing consumes data in an event log, performs actions or transformations such as filtering, aggregation, enrichment, counts, joins, etc. on events and then stores the results back to the event log. The results for stream processors may be considered “curated” or referred to as “enriched”. The results of stream processors are utilized by Event Log consumers, additional stream processors or landed into storage or analytic system.
A simple example of a stream processor is consuming order events from a log, filtering by the customer’s location to maintain a running count of orders by particular regions. This count by region example may be consumed further downstream.
There are plenty of use cases to choose from, but I think the above use cases are a fine place to start for Data Engineer. Do you? Let me know if we should include any others.
Let’s also consider when, why and how these data engineer use cases are utilized? I think the best approach to this consideration is through a series of the following questions.
Why does the world need Streaming Data Engineers?
Companies, organizations, individuals need to process data from a variety of sources to make decisions. A determining factor in the quality of decisions can be traced back to the factors used in making the decisions. Generally speaking, this traces back to the quality and quantity of the data used in the processing. The world needs data engineers to create the infrastructure to efficiently and effectively process data from a variety of sources. The number of questions and concerns data engineers face when creating, maintaining and evolving this data infrastructure is enormous. There are varying degrees of consequences for companies, organizations and/or individuals when the data engineer’s efforts and infrastructure are unsuccessful. Data engineering can leverage previously established design patterns or use cases combined with software tools and processes when determining solutions to questions and concerns.
What is a Data Engineer?
A Data Engineer is a behind-the-scenes workhorse who is often be overlooked because their results are often not visual. They are the folks in the back and the often like it that way. They don’t win the awards or the trophies, but they support the folks that do. For example, they don’t necessarily create the fancy graphs you may see in your company’s report or newspaper of choice. They may have never heard of Edward Tufte. They haven’t designed your beautiful, easy-to-use, award-winning web application, but they are focused on the data that drives it.
Consider an amazingly powerful, adaptable, life-changing machine learning model? Yeah, a data engineer likely didn’t create it.
The 4 gazillion dollars your company saved or made by making real-time logistics improvements, fraud-detection determinations, personalized production decisions? No, a data engineer likely didn’t create that result.
But again, they did provide the data infrastructure for the data scientist to create the model, the data infrastructure for a web designer/ UX consultant to design the award-winning application, and helped the data scientist to train and release their models.
You can guarantee the data engineers assisted Joe, Arjun, and Akari when they respectively determined cost savings in logistic updates, improved fraud-detection indicators, and product recommendations in near real-time.
How do you become a Data Engineer?
You study known data infrastructure and data integration use cases, patterns, and solutions. You know a programming language or two. You know the off-the-shelf software tools (open source, proprietary, as-a-service) available for you to apply and the pros/cons of each. You are humble enough to know you never stop finding ways to become a better data engineer. There are many things you do not know, so don’t be a jerk to others and act as you do.
Featured image credit https://pixabay.com/photos/board-electronics-computer-453758/