Large companies collect data in raw forms from users and multiple sources.
Most of the time, this data is unorganized or stored in an unusable format. Businesses face an ongoing challenge to sift through data stores, structure raw data, and derive insights from them.
Building a real-time data streaming platform with Node.JS and Kafka is the best way to transform source data, process it, and clean it up. Any insights from this extracted data will help make informed business decisions and enhance organizational performance outcomes.
This post explores how to build a real-time data streaming platform with Node.JS and Kafka.
Prerequisites
To follow along our Node.JS real-time data streaming tutorial, here are the prerequisites:
- Latest versions of Node.JS, npm, and JVM installed on your target machine
- Kafka installed on your local machine
- Strong knowledge of writing Node.JS applications
Repository: tulios/kafkajs
Basic Terminology
Here are a few basic terms you should familiarize yourself with before we get started.
- Zookeeper
Zookeeper manages leader elections and Kafka clusters' metadata. It is also responsible for handling partitions and is a service that can sync the configuration of Kafka servers and brokers. The official documentation provides more information about this.
- Topics
Kafka topics act as intermittent data storage mechanisms for streamed data in clusters. You can set the replication factor and various parameters for each Kafka topic
- Producers, Consumers, and Clusters
Producers produce data for Kafka brokers, and consumers consume data from them. Clusters are a group of servers that power a Kafka instance.
How to Install Kafka
The installation process for Kafka is simple. All you have to do is download the binaries here and extract the archive. Enter the following command in the terminal:
cd <location-of-downloaded-kafka-binary>
tar -xzf <downloaded-kafka-binary>
cd <name-of_kafka-binary>
Tar will extract the binary after you download it. After downloading, navigate to the Kafka installation directory and look for these files:
Your Kafka server is now ready to be configured. We can now proceed with our Node.JS and Kafka data streaming tutorial.
How to Set Up a Basic Node.JS Data Stream
Here's a simple code snippet to help you set up a basic Node.JS data stream. You can generate random numbers with it:
javascript
const { Readable } = require('stream');
class NumberStream extends Readable {
constructor(options) {
super(options);
this.max = options.max || 100;
this.current = 1;
}
_read() {
if (this.current > this.max) {
this.push(null);
return;
}
this.push(`${this.current}\n`);
this.current++;
}
}
const numberStream = new NumberStream({ max: 10 });
numberStream.pipe(process.stdout);
How to Create a Simple Kafka Processor
You can create a simple Kafka processor and use it with Node.JS to process data from Kafka topics and perform transformations.
Here is a sample code snippet that you can use to convert all incoming messages to uppercase:
javascript
const { Kafka } = require('kafkajs');
const kafka = new Kafka({
clientId: 'my-stream-processor',
brokers: ['broker1:9092', 'broker2:9092'],
});
const streamProcessing = async () => {
const consumer = kafka.consumer({ groupId: 'my-group' });
const producer = kafka.producer();
await consumer.connect();
await producer.connect();
await consumer.subscribe({ topic: 'input-topic' });
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const originalValue = message.value.toString();
const transformedValue = originalValue.toUpperCase();
await producer.send({
topic: 'output-topic',
messages: [{ value: transformedValue }],
});
},
});
};
streamProcessing().catch(console.error);
Kafka and Node.js Real-time Data Streaming Tutorial (Step by Step)
Below are the steps for building a real-time data streaming platform using Kafka and Node.JS:
- Write a YAML file and run Kafka in a container. To do that, create a Docker-compose.yaml file using the following code:
version: "3"
services:
kafka:
image: "bitnami/kafka:latest"
container_name: "kafka"
ports:
- "9092:9092"
environment:
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://127.0.0.1:9092
- KAFKA_BROKER_ID=1
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_LISTENERS=PLAINTEXT://:9092
- ALLOW_PLAINTEXT_LISTENER=yes
depends_on:
- zookeeper
zookeeper:
image: "bitnami/zookeeper:latest"
container_name: "zookeeper"
ports: - "2181:2181"
environment:
- ALLOW_ANONYMOUS_LOGIN=yes
- We will need to run two containers - Kafka and Zookeeper. Zookeeper will manage leader electrics, partition processes, and Kafka topics. It will also store topic configurations and permissions. Make sure Docker is correctly set up; go to root direct and run the following code:
docker-compose up
- Initialize the node project by using the following code:
npm init –yes
- Install Kafka dependencies by typing:
npm install kafkajs
- Write two scripts in the json file
"scripts": {
"start:consumer": "node consumer.js",
"start:producer": "node producer.js"
}
- Create two server files named js and consumer.js. Run Kafka in the container.
- Create a producer server using the following code:
import { Kafka, Partitioners } from "kafkajs"
const kafka = new Kafka({
clientId: "my-producer",
brokers: ["localhost:9092"],
})
const producer = kafka.producer({
createPartitioner: Partitioners.DefaultPartitioner,
})
const produce = async () => {
await producer.connect()
await producer.send({
topic: "my-topic",
messages: [
{
value: "Hello From Producer!",
},
],
})
}
// produce after every 3 seconds
setInterval(() => {
produce()
.then(() => console.log("Message Produced!"))
.catch(console.error)
}, 3000)
While creating a producer, use Partitioners.DefaultPartitioner to set up a default partitioner.
Import the Kafka class and use its constructor to connect with the Kafka running in a container. Use the send() method to pass an object with the message and Kafka topic name. Use the setInterval() method to produce content every 3 seconds so that the consumer can catch it.
- The final step is to create a consumer server:
import { Kafka } from "kafkajs"
const kafka = new Kafka({
clientId: "my-consumer",
brokers: ["localhost:9092"],
// dont log anything
logLevel: 0,
})
const consumer = kafka.consumer({ groupId: "my-group" })
const consume = async () => {
await consumer.connect()
await consumer.subscribe({ topic: "my-topic", fromBeginning: true })
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
console.log({
value: message.value.toString(),
})
},
})
}
// start consumingconsume().catch(console.error)
We are using the same code as before for this; however, there is one significant difference.
We pass groupID to create a group so that whenever a consumer is created, Kafka creates a new group with a random name. This makes it easier to recognize it later.
Subscribing to a chosen Kafka topic after creating a connection is crucial. This will help us listen to messages whenever they're produced. fromBeginning should be set to true to start consuming messaging from specific topics. It will enable the platform to consume all messages from topics from beginning to end.
We can run both servers using the scripts in the package.json file.
Here are the results:
Best Practices for Building a Streaming Data Platform with Data Visualization and Analytics
1. Error Handling and Retry Mechanisms
Implementing robust error handling and retry mechanisms can prevent service interruptions and errors and ensure data streaming continuity. This is especially useful in building real-time data streaming platforms for e-commerce businesses.
2. Monitoring and Logging
Gain visibility into the health and performance of your data streaming platform and application. Integrate monitoring and logging tools to track data flows and anomalies and troubleshoot critical issues. Grafana and Prometheus are the top industry-standard tools for implementing application-specific logging and can provide valuable insights into data streaming pipeline behaviors.
3. Test and Mock Streams
To optimize your data streaming architecture, use techniques like simulating data events, mocking streams, and conducting end-to-end tests. For testing purposes, you can use libraries like stream-mock.
Use Cases of Data Streaming with Node.JS and Kafka
1. Real-Time Data Analytics
Node.JS’s event-driven architecture and support for asynchronous coding make it ideal for building real-time analytics applications. You can monitor website user activities, track sensor data across IoT devices, and even analyze social media trends. You can combine Node.JS and Kafka to build a data streaming platform for social media channels, collect insights, and produce real-time analytics dashboards for further analysis.
2. Fraud Detection
Another use case is building a real-time data streaming platform with Node.js and Kafka for banking and financial services. You can build a real-time data streaming platform for banking industries to analyze transaction data and borrower behaviors and assess creditworthiness. It's a great way to prevent fraudulent transactions and defaults and helps organizations respond to suspicious activities in real-time. Node.JS has powerful data processing capabilities, which make it ideal for building fraud detection pipelines and financial data streaming applications.
3. Social Media Marketing
You can leverage Kafka and Node.JS’s real-time data visualization abilities for sentiment analysis for your social media marketing. Social media marketing involves analyzing hashtags, mentions, and followers and finding collaborators for your upcoming posts. Building a real-time data streaming platform with Node.js and Kafka for social media applications can be an excellent move for organizations. It can enhance the performance of marketing campaigns and enable companies to respond swiftly to emerging trends and discussions.
Conclusion
In this tutorial, we showed you how to build a data streaming platform using Kafka and Node.JS. You should have no problem designing an event-driven architecture or applications with it. Apache Kafka is an excellent choice for building data ingestion and processing solutions due to its high-level capabilities. It is reliable, scalable, and can handle large volumes of data in real time.
If you want to hire NodeJS developers for projects, contact Clarion Technologies today.