Apache Kafka vs. Amazon Kinesis
The question of Kafka vs Kinesis often comes up. Let’s start with Kinesis.
*** Updated Spring 2020 ***
Since this original post, AWS has released MSK. I think this tells us everything we need to know about Kafka vs Kinesis. Also, since the original post, Kinesis has been separated into multiple “services” such as Kinesis Video Streams, Data Streams, Data Firehose, and Data Analytics. I’ll make updates to the content below, but let me know if any questions or concerns
Like many of the offerings from Amazon Web Services, Amazon Kinesis software is modeled after an existing Open Source system. In this case, Kinesis is appears to be modeled after a combination of pub/sub solutions like RabbitMQ and ActiveMQ with regards to the maximum retention period of 7 days and Kafka in other ways such as sharding.
Kinesis is known to be reliable, and easy to operate. If you don’t have need for scale, strict ordering, hybrid cloud architectures, exactly-once semantics, it can be a perfectly fine choice. If you don’t have a need for certain pre-built connectors compared to Kafka Connect or stream processing with Kafka Streams / KSQL, it can also be a perfectly fine choice.
Amazon Kinesis has a built-in cross replication while Kafka requires configuration to be performed on your own. Cross-replication is the idea of syncing data across logical or physical data centers. Cross-replication is not mandatory, and you should consider doing so only if you need it.
Engineers sold on the value proposition of Kafka and Software-as-a-Service or perhaps more specifically Platform-as-a-Service have options besides Kinesis or Amazon Web Services. Keep an eye on https://confluent.io.
When to use Kafka or Kinesis?
Both Kafka and Kinesis are often utilized as an integration system in enterprise environments similar to traditional message pub/sub systems. Integration between systems is assisted by Kafka clients in a variety of languages including Java, Scala, Ruby, Python, Go, Rust, Node.js, etc.
Common use cases include website activity tracking for real-time monitoring, recommendations, etc. or loading into Hadoop or analytic data warehousing systems from a variety of data sources for possible batch processing and reporting.
Both options have the construct of Consumers and Producers.
For the data flowing through Kafka or Kinesis, Kinesis refers to this as a “Data Record” whereas Kafka will refer to this as an Event or a Message interchangeably.
Key technical components in the comparisons include ordering, retention period (i.e. greater than 7 days), scale, stream processing implementation options, pre-built connectors or frameworks for building custom integrations, exactly-once semantics, and transactions.
An interesting aspect of Kafka and Kinesis lately is the use of stream processing. More and more applications and enterprises are building architectures which include processing pipelines consisting of multiple stages. For example, a multi-stage design might include raw input data consumed from Kafka topics in stage 1. In stage 2, data is consumed and then aggregated, enriched, or otherwise transformed. Then, in stage 3, the data is published to new topics for further consumption or follow-up processing during a later stage.
Kafka and Kinesis Resources
Spark Streaming with Kafka example
Spark Streaming with Kinesis example
Kafka and Kinesis Scale
Both attempt to address scale through the use of “sharding”. In Kinesis, this is called a shard while Kafka calls it a partition. Kafka guarantees the order of messages in partitions while Kinesis does not. The canonical example of the importance of ordering is bank or inventory scenarios. The ordering of credits and debits matters. The ordering of a product shipping event compared to available product inventory matters.
Kinesis Kafka Ecosystem Comparisons
A few of the Kafka ecosystem components were mentioned above such as Kafka Connect and Kafka Streams. Let’s consider that for a moment.
Kafka Connect has a rich ecosystem of pre-built Kafka Connectors. I believe an attempt for the equivalent of pre-built integration for Kinesis is Kinesis Data Firehose. Example: you’d like to land messages from Kafka or Kinesis into ElasticSearch. How would you do that? Yes, of course, you could write custom Consumer code, but you could also use an off-the-shelf solution as well.
As briefly mentioned above, stream processing between the two options appears to be quite different. I’m not sure if there is an equivalent of Kafka Streams / KSQL for Kinesis. Please let me know. I mean, I’m thinking we could write their own or use Spark, but is there a direct comparison to Kafka Streams / KSQL in Kinesis? AWS Glue maybe?
A final consideration, for now, is Kafka Schema Registry. Kinesis does not seem to have this capability yet, but AWS EventBridge Schema Registry appears to be coming soon at the time of this writing.
Hope this helps, let me know if I missed anything or if you’d like more detail in a particular area.
Featured image credit https://flic.kr/p/7XWaia