Apache Kafka Architecture Explained: Essential Components

1. Basic Definition

Apache Kafka is an open-source, distributed event streaming platform developed by LinkedIn and later donated to the Apache Software Foundation. It is designed for high-throughput, low-latency, fault-tolerant ingestion, storage, and processing of real-time data streams at scale. Kafka is built on a publish-subscribe (pub-sub) model and is widely used for building data pipelines, streaming analytics, event-driven architectures, and real-time messaging systems. It is written in Scala and Java.

2. Core Architecture & Components

Kafka’s architecture is distributed and horizontally scalable, consisting of the following key components:

Component	Description
Producer	Applications that publish (write) data streams (events/messages) to Kafka topics. Producers can partition data across topic partitions for load balancing and parallelism.
Consumer	Applications that subscribe to (read) data streams from Kafka topics. Consumers work in consumer groups to parallelize consumption of partitioned data.
Topic	A logical category or feed name to which records are published. Topics are the core abstraction for data organization in Kafka.
Partition	Each topic is split into one or more partitions (ordered, immutable sequences of records). Partitions enable parallelism (producers write to, consumers read from partitions independently) and scalability. Records in a partition are assigned a unique offset (a sequential integer identifier).
Broker	A single Kafka server that stores topic data and handles client requests (produce/consume). A Kafka cluster consists of multiple brokers for high availability and fault tolerance.
ZooKeeper (Legacy) / KRaft (Modern)	– ZooKeeper: Historically used for cluster coordination (broker leader election, metadata management, consumer group offsets).- KRaft (Kafka Raft Metadata Mode): A built-in consensus protocol introduced in Kafka 2.8+ to replace ZooKeeper, simplifying architecture and improving scalability.
Consumer Group	A set of consumers that collaborate to consume a topic. Each partition of a topic is assigned to exactly one consumer in a group, ensuring load balancing and avoiding duplicate processing.
Offset	A unique sequential number that identifies a record’s position within a partition. Consumers track their current offset to resume reading from where they left off after restarts.

3. Core Working Principles

3.1 Data Storage Model

Kafka stores records in partitions as immutable logs—once a record is written to a partition, it cannot be modified or deleted (unless a retention policy is triggered).
Partitions are replicated across multiple brokers (replication factor) to ensure fault tolerance. One broker acts as the leader (handles produce/consume requests) for a partition, while others act as followers (replicate data from the leader and take over if the leader fails).

3.2 Publish-Subscribe Workflow

Producer Sends Data: Producers send records to a topic, specifying a partition key (optional). If a key is provided, Kafka uses a hash function to map the key to a partition (ensuring records with the same key go to the same partition). If no key is provided, records are distributed round-robin across partitions.
Broker Stores Data: The leader broker for the target partition writes the record to its log and replicates it to follower brokers. Once the record is replicated to a majority of followers, it is marked as committed and becomes available for consumption.
Consumer Reads Data: Consumers in a group request records from the leader broker of assigned partitions, starting from their last committed offset. Consumers can commit offsets manually or automatically to track their progress.

4. Key Features

High Throughput: Kafka can handle millions of records per second with minimal latency (sub-millisecond to a few milliseconds), even with large message sizes.
Fault Tolerance: Partition replication and leader election ensure that data is not lost if a broker fails. Committed records are preserved as long as the replication factor is met.
Scalability: The cluster can be scaled horizontally by adding more brokers. Partitions can be rebalanced across brokers to handle increased load.
Persistence: Data is stored on disk (not just in memory), making it durable. Retention policies (time-based or size-based) control how long data is retained (e.g., 7 days, 100 GB per partition).
Exactly-Once Semantics: Kafka supports idempotent producers and transactional APIs to ensure that records are delivered exactly once, even in the event of network failures or broker restarts.
Stream Processing: Integrates with Kafka Streams, a lightweight library for building real-time stream processing applications (e.g., filtering, aggregating, joining data streams) directly on Kafka.

5. Typical Application Scenarios

Event-Driven Architecture (EDA): Decouples microservices by using Kafka as a central event bus (e.g., order service publishes an OrderCreated event; payment service, inventory service subscribe to it).
Data Pipelines: Ingests data from multiple sources (databases, IoT devices, logs) into data lakes or data warehouses (e.g., Kafka Connect for integrating with external systems).
Real-Time Analytics: Feeds real-time data to analytics tools (e.g., Apache Flink, Spark Streaming) for live dashboards, fraud detection, or user behavior analysis.
Log Aggregation: Collects log data from distributed systems and applications into a centralized stream for monitoring and troubleshooting.
Message Queuing: Replaces traditional message queues (e.g., RabbitMQ) for high-throughput, persistent messaging with better scalability.

6. Key Ecosystem Tools

Schema Registry: Manages Avro, Protobuf, or JSON schemas for Kafka records, ensuring data compatibility between producers and consumers.

Kafka Connect: A framework for connecting Kafka to external systems (databases, cloud storage, SaaS platforms) with pre-built connectors (no custom code required).

Kafka Streams: A client library for building stream processing applications that process data directly within Kafka, without needing separate clusters.

Confluent Platform: A commercial distribution of Kafka that includes additional tools (Schema Registry, ksqlDB) for schema management and SQL-based stream processing.