Notes on Kafka Part -1

4 min readOct 23, 2017

Personal notes! I have worked with Kafka in past, however I needed to catch up after an year of inactivity. Making notes while reading articles in references.

This part is going to the basics only. I am planning to write multiple posts covering advanced topics, existing designs and the best practises.

One phrase to define Kafka — Publisher Subscriber Distributed Streaming pipeline with persistence and replay option(retention time is configurable). If it helps, visualise it as tailf command in Linux, but with more features.
Data is always in Byte Arrays, so it can have virtually any data format which can be encoded to Byte Array and vice versa.
Can have any type of producer and consumers as long as they implement, Kafka’s protocol(is protocol a correct term here?).
Consumers take the responsibilities to maintain the offset, where they have read last from.

Kafka Architecture

Topic

Highest level of container for a message. Topic should have logical relationship with the message. Conversely, all the messages with same logical groups should be passed to one topic. For e.g. the messages for user login actions should go to a topic called “user-login” and this topic shouldn’t contain anything else, like user purchase actions, or etc. Topic is further divided into one or more partitions.

Partition

Next level of hierarchy, Partitioning provides parallelism, redundancy(read fault tolerance) and ordering. Essentially a consumer consumes messages in a partition in FCFS order. In other words, Kafka ensures that the messages are ordered inside a partition (not across topic as suggested in [1]). So, if you need message A and message B to be ordered, you need to assign them same key, hence same partition. Due to FCFS nature, a partition is also referred as log in some literature.
Each partition in Kafka can be stored in one or many servers. Basically a message in one partition is completely independent of the other message. This gives several benefits to the architecture. These messages can be read and processed independently by consumers, hence the amount of possible parallelism achievable is equal to number of partitions. Also, since the partitions are independent, they can be stored on same of different servers. Kafka provides a field called replication factor, say n. That means, each partition will be copied to n servers, so the data is not lost until maximum of n-1 server fail.

Message/Record

Message/Record is the smallest and actual unit of data. It comprises the following:-

Key: Decides which partition the message is going to. Random key means uniform load balancing.
Value: Actual message in Byte array format.
Timestamp: Date of creation of the record.

Actor — Producer

Produces Messages by encoding data into byte array.
Publish(push) to one or more topics.
Inside topic, producer decides which partition to write. it could be round robin, or depending on some key.
Processes, which can be present on same or different machines.

Actor — Consumer

Consumers are grouped together in an entity called Consumer-groups.
Each consumer groups can listen to only one topic. Basically there is n:1 relationship between consumer groups and topics.
Inside each consumer group, a consumer can read from one or more partition. n:1 relationship between partition and consumers.
Consumers pull the messages. Hence they lead the pace of messaging.
Effectively there is 1:1 relationship between a message and a consumer in consumer group.
Each consumer is a separate process, and can be present in same and different servers.
Each consumer maintains offset for its corresponding partitions.
From above it is clear that — there cannot be more consumer instances within a single consumer group than there are partitions. This is obviously required to maintain an order inside partition and to prevent duplicacy.

Actor — Brokers

Kafka is a persistent system. That means the sent to kafka partitions is actually stored in a secondary storage. Each Server in Kafka which holds the data is known as Broker. See image 2[3].

Broker can be considered as main processes which listens to Producer pushing messages and consumers pulling.
Number or brokers ≥ number of replication factors.
Persistence can configured in Kafka. More number of days of storage means more data, bigger disk, however Kafka’s speed will remain constant to size of retention period.
There will be atleast 1 leader, and 0 or more followers, what that means is that leader will be receiving message from producers, follower brokers will pull it from leader.

Future Reading

Next post will focus on how Kafka ensures — fault tolerance, ordered delivery, consistency and efficiency. Further posts will include some real life case studies.

References

Disclaimer: All the text here are my personal views and are not related to my employer or anyone else. I have written my understanding and notes for myself. I am not responsible for any decision motivated by the text. I have provided references and citations from where I have read and taken images from. If you believe that you have right to the content, contact me, I would remove the same.