Apache Kafka, a Distributed Streaming Platform

Apache Kafka is a stream processing platform that was initially developed by LinkedIn and was open sourced in 2011. Apache Kafka is used for developing streaming apps and real-time data pipelines. Currently, Apache Kafka is used by lots of companies, including Microsoft, Netflix, Airbnb, the New York Times, Goldman Sachs, Target, Line, LinkedIn, Intuit, eBay, Walmart, Shopify, PayPal, Yelp, Uber, and Betfair.


Kafka is written in Java and Scala and is a part of Hadoop ecosystem. The goal of the project is to provide a unified, high-speed, low-latency platform for managing real-time data feeds. It functions like a publish/subscribe messaging system but has more advanced features such as fault tolerance, replication, built-in partitioning, and high throughput. These features make Kafka a highly attractive option for data integration. Kafka is frequently used together with Spark Streaming, Apache Storm, and Apache Hadoop.


As a streaming platform, Kafka has 3 key capabilities:


  • It allows users to publish and subscribe to streams of data like a messaging system.
  • It allows users to store streams of data in a fault-tolerant way in a distributed and replicated cluster.
  • It allows users to process streams of data in real time with scalable stream processing applications.


The platform manages its publish and subscribe messaging system with 5 core APIs:


  • The Producer API enables an application to send stream of records to Kafka topics
  • The Consumer API enables applications to subscribe to a topic in Kafka cluster and read streams of data.
  • The Streams API makes it possible to effectively transform input streams of records to output streams.
  • The AdminClient API lets manage and inspect different Kafka objects, for example, topics, acls, and brokers.
  • The Connector API enables developing and running reusable consumers or producers that provide connections for Kafka topics to existing data systems or applications.


Kafka is also commonly used for log aggregation, website activity tracking, stream processing, and operational metrics.