Question 1

How would you explain Kafka to a non-technical stakeholder in your team?

Accepted Answer

Apache Kafka is a distributed event streaming platform that acts like a high speed logistics hub for data. Producers send messages (events) to Kafka topics , which are durable, ordered logs. Consumers then read from those topics in real time. Key analogy: Producers are like trucks delivering parcels (data) Topics are labeled shelves where parcels are temporarily stored in order Consumers are staff who pick up parcels and deliver them to their final destination Kafka ensures smooth, reliable, real time flow of data from any source to any destination. It decouples producers from consumers so each side can scale independently.

Question 2

You have 5 million events coming in every minute. How would you handle topic design?

Accepted Answer

At ~83K events/second, you need a topic design optimized for high throughput and parallel consumption . Partition strategy: Use 50+ partitions so multiple consumers can read in parallel Choose a well distributed partition key (e.g., hashed user ID or event ID) to avoid hot partitions Ensure the number of partitions is = the number of consumers in your consumer group Throughput optimizations: Enable compression (Snappy or LZ4) to reduce network and disk I/O Tune batch.size (e.g., 64KB 128KB) and linger.ms (e.g., 5 20ms) on producers for batching Set acks=1 if you can tolerate rare message loss, or acks=all for durability Monitoring: Track consumer lag via tools like Burrow, Prometheus, or Confluent Control Center Alert on lag growth to detect throughput bottlenecks early

Question 3

What's the impact of having a very high number of partitions?

Accepted Answer

While more partitions improve parallelism and throughput , they introduce significant overhead at scale. Negative impacts: More open file handles each partition has multiple log segments, index files, and time index files on each broker Slower leader elections when a broker fails, the controller must elect new leaders for all its partitions; thousands of partitions mean longer unavailability windows Higher controller load metadata management becomes expensive with many partitions Increased end to end latency producers using acks=all must wait for more replicas to acknowledge Longer broker restart times log recovery for thousands of partitions is slow Rule of thumb: Start with partitions = max(expected throughput / throughput per partition, number of consumers) Kafka documentation suggests keeping partitions per broker under ~4,000 and total cluster partitions under ~200,000 (KRaft mode supports more)

Question 4

What happens if a producer sends data to a topic that doesn't exist?

Accepted Answer

The behavior depends on the broker configuration auto.create.topics.enable . If auto.create.topics.enable=true (default): Kafka automatically creates the topic with default settings ( num.partitions , default.replication.factor ) This is dangerous in production because the defaults may not match your SLA requirements (e.g., replication factor of 1 means no fault tolerance) If auto.create.topics.enable=false : The producer receives an UNKNOWN TOPIC OR PARTITION error The message is not written Best practice: Disable auto creation in production Define topics explicitly using infrastructure as code (Terraform, Ansible, or CLI scripts) This ensures correct partition counts, replication factors, and retention policies

Question 5

How do consumer groups handle rebalancing and what issues can it cause?

Accepted Answer

A rebalance is triggered when the consumer group membership changes a consumer joins, leaves, crashes, or when topic metadata changes (e.g., new partitions added). The group coordinator broker reassigns partitions across the remaining consumers. How it works: 1. The group coordinator detects the membership change 2. It revokes partition assignments from all (or some) consumers 3. It reassigns partitions according to the partition assignor strategy 4. Consumers rejoin and start consuming from their last committed offsets Problems caused by rebalancing: Stop the world pause with the default eager rebalancing, all consumers stop processing during reassignment Duplicate processing if offsets were not committed before the rebalance, messages get reprocessed Cascading rebalances slow consumers timing out can trigger repeated rebalances Mitigation strategies: Use CooperativeStickyAssignor to enable incremental cooperative rebalancing (only affected partitions are revoked) Tune session.timeout.ms and heartbeat.interval.ms to avoid false positive failures Minimize consumer group churn by using stable deployments (rolling restarts)

Question 6

Scenario: One consumer in a group is significantly slower. What happens?

Accepted Answer

Kafka does not automatically redistribute partitions based on consumer speed. The slow consumer will fall behind, building up consumer lag on its assigned partitions, while other consumers finish quickly and sit idle. What can go wrong: If the slow consumer exceeds max.poll.interval.ms without calling poll() , Kafka considers it dead and triggers a rebalance This can cause a cascading cycle of rebalances if the consumer repeatedly times out Solutions: Scale out add more consumer instances so each handles fewer partitions Increase max.poll.interval.ms for consumers with legitimately slow processing Reduce max.poll.records so each poll() returns fewer records, allowing the consumer to call poll() more frequently Optimize processing logic profile the slow consumer to find the bottleneck (DB writes, API calls, serialization) Use async processing decouple polling from processing with an internal buffer/queue

Question 7

Can Kafka guarantee exactly-once delivery? If yes, how?

Accepted Answer

Yes, Kafka supports exactly once semantics (EOS) through a combination of three mechanisms, but it requires careful end to end design. 1. Idempotent Producer: Enabled with enable.idempotence=true Kafka assigns each producer a Producer ID (PID) and tracks sequence numbers per partition Duplicate messages (from retries) are detected and discarded at the broker Guarantees exactly once within a single partition from a single producer session 2. Transactions: The producer wraps reads + writes in an atomic transaction using transactional.id Enables atomic writes across multiple partitions and topics Consumers set isolation.level=read committed to only see committed messages 3. Kafka Streams EOS: Set processing.guarantee=exactly once v2 in Kafka Streams Internally uses transactions to atomically commit offsets and output messages together Important caveat: True exactly once only holds within the Kafka ecosystem. If your downstream sink (e.g., a database) does not support idempotent writes, you can still get duplicates at the sink. Design sinks with upserts or deduplication keys .

Question 8

What is ISR (In-Sync Replica) and why is it critical?

Accepted Answer

The In Sync Replica (ISR) set is the group of replicas (including the leader) that are fully caught up with the leader's log. Only replicas in the ISR are eligible to become the new leader if the current leader fails. How it works: Each partition has a leader and zero or more follower replicas Followers continuously fetch from the leader; if a follower falls behind by more than replica.lag.time.max.ms (default 30s), it is removed from the ISR When it catches up, it is added back Why ISR is critical: Durability with acks=all , the producer only gets an acknowledgment when all ISR replicas have written the message. A shrinking ISR means fewer replicas guarantee durability Availability if the ISR shrinks to just the leader and the leader crashes, you either lose data or face unavailability (depending on unclean.leader.election.enable ) Consistency only ISR members can become leaders, ensuring no data loss during failover Key configurations: min.insync.replicas=2 requires at least 2 replicas in ISR for acks=all writes to succeed unclean.leader.election.enable=false prevents out of sync replicas from becoming leader (avoids data loss)