Kafka

Big Data

Hello Kafka: Getting Acquainted With Data Streaming

Kafka is a streaming platform used by organizations to provide streams of records for big data analytics and application messaging. In this article, we introduce Kafka and explore how to create and populate topics using CLI tools.

Kafka is a big data streaming platform used by many enterprises to:

Provide streams of records, similar to a message queue or enterprise messaging system, to which clients may publish data or subscribe.
Store streams of records in a fault-tolerant, durable way so that clients may consume the streams and take action on the data.

Since its introduction at LinkedIn in 2011, Kafka has become an essential part of the Big Data landscape. Through the use of its "topics" and ecosystem of components, Kafka provides a uniform way to move data between systems, while providing the foundation for stream analytics and building applications that are able to react to data as it becomes available.

In this article, we will introduce the fundamental components of Kafka and show how to create topics, populate them with sample data, and consume the resulting streams. This article is part of a series that uses a Docker Based Big Data Laboratory that can be deployed using docker-compose. Please see Part 1 for an overview of the environment and deployment instructions.

Kafka Essentials

Kafka is a message system which can be used to build data streams. It is used to manage the flow of data and ensure that data is delivered to where it is intended to go.

Data streams can operate upon explicit types of data: things like patients, orders, diagnosis, treatments, or orders in healthcare; or things like sales, shipments, and returns in general business. But it is not limited to that, it can also be used to model the actions and events which produce certain types of objects and results.
Treating a data as a stream allows the creation of applications that process and respond to specific events, and in the process, trigger other important actions.

Kafka Capabilities

Kafka provides two main uses cases which make it an "application hub":

Stream Processing

Kafka enables continuous, real-time applications to react to, process and transform streams.

Data Integration

Kafka captures events and feeds those to other data systems including big data processing engines such as Spark, NoSQL (key/value systems or document stores), Object Storage (like MinIO, Ceph, Amazon S3, and OpenStack Swift), and relational databases (PostgreSQL and MySQL).

To make it effective as a streaming platform, Kafka provides:

Real-time publication/subscription at large scale. Kafka's implementation allows for low latency, making it possible for real-time applications to leverage Kafka on time sensitive data and still operate with high throughput capacity.
Capabilities for processing and storing data. Kafka is slightly different from other messaging applications, like RabbitMQ, in that stores data for a period of time. The storage of data persistently allows for re-playing of the data, as well as integration with batch-systems like Hadoop or Spark.

Kafka organizes its stream of records as a commit log into streams which are published as "topics".

A topic is multi-subscriber and can be consumed by one application or many.
"Consumers" are responsible for tracking their position, which allows for different types of consumers to process messages at their own pace.
Consumers can be organized into "Groups" to allow for horizontal scalability of the same type of consumer.
Each piece of data (a record) that is published to Kafka consists of a key, a value, and a timestamp. The value provided can be almost anything including semi-structured data such as JSON or Avro, or unstructured data such as plain text.

Applications access event streams by connecting to "Brokers," which store the data and respond to access requests from clients (consumers).

Applications which connect to Kafka are generally categorized into Producers and Consumers:

Producers add messages and data to a topic
Consumers subscribe to topics and process messages

Kafka CLI Utilities

The tools and utilities in this lab are included as part of the Jupyter container in the Big Data lab. You can access them by opening a terminal instance. Terminals are opened by clicking on the "Terminal" option under the "Other" menu.

Hello Kafka Screenshot: Create a New Terminal in JupyterLab

Throughout this lab, you will need to have multiple terminals open. In one terminal you will type commands to submit data, while in the other terminal you will consume data using a separate program.

Jupyter allows you to stack multiple terminals on top of one another (or side-by-side) after you have created them by clicking and dragging the terminal's tab.

Hello Kafka Screenshot: Stack Terminals On Top of One Another

All of the Kafka CLI utilities used in this article can be found in /opt/kafka/bin and include:

/opt/kafka/bin/kafka-topics.sh: the command line utility used to manage topics and their configuration
/opt/kafka/bin/kafka-console-producers.sh: a CLI based producer that allows you to publish data onto a Kafka topic for testing or debugging
/opt/kafka/bin/kafka-console-consumer.sh: a CLI based consumer that allows you to consume the messages of a topic for testing or debugging

Producers and Consumers

1. Inspect Existing Topics

In the Kafka shell view all the topics currently defined in the cluster using kafka-topics:

$ /opt/kafka/bin/kafka-topics.sh --list --zookeeper zookeeper:2181

2. Create a New Topic

Using kafka-topics.sh, create a topic that will be used for the remainder of the lab called test.

$ /opt/kafka/bin/kafka-topics.sh --create --topic test --partitions 1 \
    --replication-factor 1 --if-not-exists --zookeeper zookeeper:2181

The --replication-factor 1 option tells Kafka that the topic has a redundancy of one (which is to say, no redundancy at all). In a production system where retention of data is important, you would want a redundancy of two or three. After the topic creates successfully, ensure that it is in the list by using the kafka-topics --list command from above.

3. Consume Messages from a Topic

The kafka-console-consumer.sh tool provides a number of options to allow you to connect to a topic and inspect the messages. From the terminal session, start a kafka-console-consumer.sh session listening to the test topic you created earlier.

$ /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 \
    --topic test --from-beginning

In the connection options, we provide a --bootstrap-server which is used by the client to contact the cluster. The cluster will then help the client to determine where the topic partitions that it should be listening to are located and will facilitate connections.

4. Publish Messages to a Topic

Start a kafka-console-producer.sh instance. Once the kafka-console-producer.sh instance starts, you will see a diamond bracket prompt (>`):

$ /opt/kafka/bin/kafka-console-producer.sh --broker-list kafka:9092 --topic test
>

kafka-console-producer.sh can be used to send plain text messages to the --topic specified. At the prompt, enter a few messages:

> "Hello Kafka! This is my first message!"
> {"hello": "world", "payload": "This is a JSON blob!"}
> <hello-world><msg>Hello world! This is an XML document!</msg></hello-world>

Each message will be sent upon hitting the “Enter” button.

From the first window (kafka-console-consumer.sh ), watch the messages as they are processed. Stop the consumer instance and experiment with the --offset and --partition options to experiment with ways in which the message can be consumed.

Kafka

Big Data

Hello Kafka: Getting Acquainted With Data Streaming

Kafka Essentials

Kafka Capabilities

Stream Processing

Data Integration

Kafka CLI Utilities

Producers and Consumers

1. Inspect Existing Topics

2. Create a New Topic

3. Consume Messages from a Topic

4. Publish Messages to a Topic

Brandon Harper Nov 22, 2024

Categories

Big Data

Containers

Loading

Unable to find related content

Comments

Loading

No results found