Apache Kafka
WhatsThat?
Apache Kafka is a message broker.
It's main purpose in life is to receive messages from publishers and make them available to subscribers.
WhyIsItCool?
Kafka is cool because it is one of the most reliable, most performant message brokers out there today.
It is open source. It was originally developed and is still actively used at LinkedIn.
It has a large adoption pool and a very active support community as well as top-notch support providers.
Other than providing publish/subscribe messaging capabilities, it also allows transformations, enrichment, filtering and processing to be performed on any messages passing through it with ease.
HowDoesItWork?
Kafka is based on the concept of a Distributed Commit Log.
It stores messages in transit into an append only log (file). This is very similar to the way relational database engines store their transaction logs. Every message is simply added to the end of the file, which makes it super fast.
Broker and Cluster
Any machine or server running Kafka software is referred to as a Broker. If there are more than one Broker running, those Brokers will by default form a Cluster. All of the Brokers in a Cluster share the load for the messages they need to process. They also provide a way to replicate messages between them to safeguard these messages, should one of them fail.
Offset
As a message is written to the log, it is assigned an Offset. An Offset is a unique, sequential number that identifies that specific message in that specific log.
But logs can grow very large, so Kafka enables scalability by making it distributed.
There are effectively many of these single logs, all assigning Offsets to messages they need to handle.
The way in which Kafka groups or routes the messages into these distributed logs is controlled by two concepts namely Topics and Partitions.
Topics
A Topic is a definition or grouping of specific messages. It gives a name by which we can address a message stream. It also provides the ability to define a structure of the messages that will form part of this Topic.
As an example, we can define a Topic called PURCHASES which we will publish all purchase messages to:
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic PURCHASES
Partitions
When we create a Topic, we can specify how many Partitions that Topic should have. A Partition refers to a specific log file. Partitions for the same Topic normally reside on different Brokers.
If we create a Topic with more than one Partition, messages that is flowing through that Topic will be stored using the different Partitions.
Assigning different Partitions to different Brokers for a Topic means that we are adding the ability to scale for that Topic as the workload is shared by different Brokers.
Replication
Every Partition for a Topic is referred to as a Replica. We can specify how many Replicas of every Partition we want to create for a Topic. This is also called the Replication Factor.
Only a single Replica per Partition is active at a given point in time. That Replica is referred to as the Leader Replica. All other Replicas are referred to as Follower Replicas.
Different Replicas for a Partition also reside on different Brokers. This provides Replication and ensures that a single Broker failure will not result in a Partition loss. (Given there are more than 1 available, healthy Broker)
The Broker hosting the Leader Replica handles all requests for that Partition. Publishers publish messages directly to that Broker and all Subscribe requests are also handled directly by that Broker. The Follower Replicas simply subscibe to the Leader Replica and try to keep themselves in sync with the Leader Replica.
Offset as a unique identifier
Offset provides a unique identification for a message in the following combination:
Topic-Partition-Offset
Example:
Purchases-1-456332
Purchases-2-456332
Purchases-3-4344
Quotes-1-456332
Retention
Kafka keeps the log files (Partitions) for a specified duration or until the Topic reaches a certain size.
The default for this duration is set to 7 days.
After the Retention period expires, that log Segment is deleted.
(A log Segment is the most granular unit for a log. A log file is actually a bunch of different files referred to a Segments)
Messages are available to Subscribers as long as they are available in the log files, thus, as long as the retention period or size constraints have not been reached.
Message Consume and Replay
Now that messages are persisted to disk consumers can ask for those messages at any given time. They do this by specifying an Offset to start reading messages from.
The Broker then finds that Offset in the log files and starts sequentually, in order, sending those messages to the Subscriber.
CaseStudy
We have a fictional company, BuyAnAppleDonut, which sells ApricotDonuts. (Strangely enough).
We are only interested in purchases, so we will create a Topic called Purchases.
Our donuts sell like crazy and we have a million sales per day. We fear this will be too much for a single Broker to handle.
We do sales of equal amounts in Canada, USA and South Africa.
We thus create a Partition in the Purchases Topic based on Country.
Let's say we have 3 Kafka Brokers. Let's also say we have decided that we want at least two additional copies of our Partitions, so that two servers can fail and we can still sell donuts.
Kafka will Distribute the Partitions so that the Leader Replica will reside on different Brokers for the different Partitions.
Thus, our 3 Kafka Brokers will all have a copy of all 3 Partitions (South Africa, USA and Canada). Broker 1 will serve only requests for South Africa, and be a backup for USA and Canada. Broker 2 will serve only requests for USA, and be a backup for South Africa and Canada.
Can you guess who will serve Canada's messages?
It depends. If all is well, Broker 3 will be serving Canada, but if anything happened to Broker 3, either Broker 1 or 2 would be automatically elected the new Leader Replica for Canada and start serving requests for Canada.
One amazing aspect of Kafka, found in almost no other message brokers out there, is the ability for easy message replay. So easy in fact that the Consumer can control it. If we would like to receive data from the past, we can simply set the Offset and Partition to consume from, in our Consumer client, and we can get all of our messages as long as they have not expired.
As long as we keep track of the Partition and Offset combinations for the Topics we are consuming as a subsciber, we can get any message, all messages, or a subset of messages, over and over again.
ThatsAllFolks
Sticking to the WhatsThat idea, we need to cap this short intro.
Here's a couple of links for further reading: