Apache Kafka
Apache Kafka is a distributed event streaming platform that acts like a high speed logistics hub for data. Producers send messages (events) to Kafka top...
How would you explain Kafka to a non-technical stakeholder in your team?
Apache Kafka is a distributed event streaming platform that acts like a high speed logistics hub for data. Producers send messages (events) to Kafka topics , which are durable, ordered logs. Consumers then read from those topics in real time. Key analogy: Producers are like trucks delivering parcels (data) Topics are labeled shelves where parcels are temporarily stored in order Consumers are staff who pick up parcels and deliver them to their final destination Kafka ensures smooth, reliable, rea
You have 5 million events coming in every minute. How would you handle topic design?
At ~83K events/second, you need a topic design optimized for high throughput and parallel consumption . Partition strategy: Use 50+ partitions so multiple consumers can read in parallel Choose a well distributed partition key (e.g., hashed user ID or event ID) to avoid hot partitions Ensure the number of partitions is = the number of consumers in your consumer group Throughput optimizations: Enable compression (Snappy or LZ4) to reduce network and disk I/O Tune batch.size (e.g., 64KB 128KB) and
What's the impact of having a very high number of partitions?
While more partitions improve parallelism and throughput , they introduce significant overhead at scale. Negative impacts: More open file handles each partition has multiple log segments, index files, and time index files on each broker Slower leader elections when a broker fails, the controller must elect new leaders for all its partitions; thousands of partitions mean longer unavailability windows Higher controller load metadata management becomes expensive with many partitions Increased end t
What happens if a producer sends data to a topic that doesn't exist?
The behavior depends on the broker configuration auto.create.topics.enable . If auto.create.topics.enable=true (default): Kafka automatically creates the topic with default settings ( num.partitions , default.replication.factor ) This is dangerous in production because the defaults may not match your SLA requirements (e.g., replication factor of 1 means no fault tolerance) If auto.create.topics.enable=false : The producer receives an UNKNOWN TOPIC OR PARTITION error The message is not written Be
How do consumer groups handle rebalancing and what issues can it cause?
A rebalance is triggered when the consumer group membership changes a consumer joins, leaves, crashes, or when topic metadata changes (e.g., new partitions added). The group coordinator broker reassigns partitions across the remaining consumers. How it works: 1. The group coordinator detects the membership change 2. It revokes partition assignments from all (or some) consumers 3. It reassigns partitions according to the partition assignor strategy 4. Consumers rejoin and start consuming from the
Scenario: One consumer in a group is significantly slower. What happens?
Kafka does not automatically redistribute partitions based on consumer speed. The slow consumer will fall behind, building up consumer lag on its assigned partitions, while other consumers finish quickly and sit idle. What can go wrong: If the slow consumer exceeds max.poll.interval.ms without calling poll() , Kafka considers it dead and triggers a rebalance This can cause a cascading cycle of rebalances if the consumer repeatedly times out Solutions: Scale out add more consumer instances so eac
Can Kafka guarantee exactly-once delivery? If yes, how?
Yes, Kafka supports exactly once semantics (EOS) through a combination of three mechanisms, but it requires careful end to end design. 1. Idempotent Producer: Enabled with enable.idempotence=true Kafka assigns each producer a Producer ID (PID) and tracks sequence numbers per partition Duplicate messages (from retries) are detected and discarded at the broker Guarantees exactly once within a single partition from a single producer session 2. Transactions: The producer wraps reads + writes in an a
What is ISR (In-Sync Replica) and why is it critical?
The In Sync Replica (ISR) set is the group of replicas (including the leader) that are fully caught up with the leader's log. Only replicas in the ISR are eligible to become the new leader if the current leader fails. How it works: Each partition has a leader and zero or more follower replicas Followers continuously fetch from the leader; if a follower falls behind by more than replica.lag.time.max.ms (default 30s), it is removed from the ISR When it catches up, it is added back Why ISR is criti