Building Scalable Streaming Applications

DataStax has recently released its Astra Streaming, enabling developers to build streaming applications on top of an elastically scalable, multi-cloud messaging and event streaming platform powered by Apache Pulsar. This article will walk you through a short demo that will provide a great starting point for familiarizing yourself with this powerful new streaming service.

Here’s what you will learn:

Why Pulsar Beats Kafka for a Scalable, Distributed Data Architecture

The leading open-source event streaming platforms are Apache Kafka and Apache Pulsar. For enterprise architects and application developers, choosing the right event streaming approach is critical, as these technologies will help their apps scale up around data to support operations in production.

Everyone wants results faster. We want applications that know what we want, even before we know ourselves. We want systems that constantly check for fraud or security issues to protect our data. We want applications that are smart enough to react and change plans when faced with the unexpected. And we want those services to be continuously available.

API Security Weekly: Issue 171

This week, we have news of multiple API flaws and vulnerabilities: the parcel tracking portal at DPD that may have exposed customer data, an API vulnerability in the Apache Pulsar that allowed access data in different tenants, and an SQL injection vulnerability in Casdoor API. On the more positive side, we take a look at the emerging trends in the API industry.

Vulnerability: DPD Parcel Tracking Flaw May Have Exposed Customer Data

The big news this week was the disclosure of a vulnerability in the parcel tracking portal of DPG Group, which may have exposed customer data.

Bring Streaming to Apache Cassandra with Apache Pulsar

Twitch, YouTube, Instagram, Facebook — virtually every major brand nowadays uses live streaming to connect and engage their audience. For enterprises and developers building cloud-native applications, this growing trend creates a need for streaming technologies that can reliably handle the rush of massive amounts of data, while also being flexible and easy to manage for developers.

One such technology is Apache Pulsar® — an open-source, distributed messaging and streaming platform that’s easy to deploy, simple to scale, and packed with developer-friendly APIs. So the next question is: how can you stream from Pulsar to Apache Cassandra®, the powerful NoSQL database designed to support data-heavy applications in the cloud?

Join our beginner-friendly Pulsar workshop on YouTube and learn how to connect Pulsar with Cassandra for streaming! In this post, we’ll set the scene with an introduction to Pulsar and guide you through four hands-on exercises where you’ll use these free, cloud-native technologies: Katacoda, Kesque, GitPod, and DataStax Astra DB. Each exercise will also be linked to the step-by-step instructions on the DataStax Developers GitHub wiki.

7 Reasons to Choose Apache Pulsar over Apache Kafka

So why did we build our messaging service using Apache Pulsar?

At DataStax, our mission is to empower developers to build cloud-native distributed applications by making cloud-agnostic, high-performance messaging technology easily available to everyone. Developers want to write distributed applications or microservices but don’t want the hassle of managing complex message infrastructure or getting locked into a particular cloud vendor. They need a solution that just works. Everywhere.

Real-Time Pulsar and Python Apps on a Pi

Today we will look at the easy way to build Python streaming applications from the edge to the cloud. Let's walk through how to build a Python application on a Raspberry Pi that streams sensor data and more from the edge to any and all data stores while processing data in event time.

My GitHub repository has all of the code, configuration, and scripts needed to build and run this application.

Streaming Real-Time Chat Messages Into Scylla With Apache Pulsar

At Scylla Summit 2022, I presented "FLiP Into Apache Pulsar Apps with ScyllaDB". Using the same content, in this blog, we'll demonstrate step-by-step how to build real-time messaging and streaming applications using a variety of OSS libraries, schemas, languages, frameworks, and tools utilizing ScyllaDB. We'll also introduce options from MQTT, Web Sockets, Java, Golang, Python, NodeJS, Apache NiFi, Kafka on Pulsar, Pulsar protocol, and more. You will learn how to quickly deploy an app to a production cloud cluster with StreamNative, and build your own fast applications using the Apache Pulsar and Scylla integration.

Before we jump into the how, let's review why this integration can be used for a speedy application build. Scylla is an ultra-fast, low-latency, high-throughput, open-source NoSQL platform that is fully compatible with Cassandra. Populating Scylla tables utilizing the Scylla-compatible Pulsar IO sink doesn't require any complex or specialized coding, and the sink makes it easy to load data to Scylla using a simple configuration file pointing to Pulsar topics that stream all events directly to Scylla tables.

Deploying AI With an Event-Driven Platform

This is an article from DZone's 2022 Enterprise AI Trend Report.

For more:


Read the Report

Today, many large organizations are deploying artificial intelligence (AI) models with an event-driven platform in order to solve two common challenges of leveraging enterprise AI. First, to meet their data needs, enterprises often require a variety of model types that are built on different machine learning (ML), deep learning, and AI languages, frameworks, tools, and systems. These models are tied to various ways of deployment, using tools such as PyTorch, scikit-learn, XGBoost, DJL.AI, spaCy, TensorFlow, ONNX, PMML, Apache MXNet, and H2O. As a result, developers and data engineers need to deploy their models in diverse deployment environments with varying characteristics and restrictions, which makes accessing and managing the models complicated. 

Generating Simulated Streaming Data

For demos, system tests, and other purposes, it is good to have a way to easily produce realistic data at scale utilizing a schema of our own choice.

Fortunately, there is a great library for Python called Faker that lets us build synthetic data for tests. With a simple loop and a Pulsar produce call, we can send messages to topics at scale.

Pulsar in Python on Pi for Sensors

I have a new Raspberry Pi with a Breakout Garden with a thermal camera, 1.12" OLED screen, and a CO2+ sensor.

We first need to install the Pulsar Python Client, if you are running on certain architectures you will need to compile the Apache Pulsar C++ Client first.

Pulsar on KubeSphere: Installing Distributed Messaging and Streaming Platform

KubeSphere, an open-source container platform running on Kubernetes, provides users with an app-centric experience. In this connection, it features a comprehensive set of tools for developers to manage apps across their entire lifecycle. In this article, I will demonstrate how to install Apache Pulsar on a KubeSphere cluster as an example. Apache Pulsar, a cloud-native, distributed messaging and streaming tool, represents a go-to platform to meet the real-time event-streaming needs of enterprises.

Before You Begin

To install Pulsar on KubeSphere, you need to do the following beforehand:

Developing an Enterprise-Level Apache Cassandra Sink Connector for Apache Pulsar

When DataStax started investing in streaming with Apache Pulsar™, we knew that one of the first things people would want to do was connect existing enterprise data sources to Apache Cassandra™ using Pulsar.

Apache Pulsar has a powerful framework called Pulsar IO to enable this kind of use case, and at DataStax we already had a best-in-class Kafka Connect Sink that enables you to store structured data coming from one or more Kafka topics into DataStax Enterprise, Apache Cassandra, and Astra.

Simplify Migrating From Kafka to Pulsar With Kafka Connect Support

Large-scale implementations of any system, such as the event-streaming platform Apache Kafka, often involve customizations and tools and plugins developed in-house. When it’s time to transition from one system to another, the task can become complicated, drawn-out, and error-prone. Often the benefits of an alternative system (which can include significant cost savings and other efficiencies) are outweighed by the risks and costs of migration. As a result, an organization can end up locked into a suboptimal situation, footing a bigger bill than necessary and missing out on modern features that help move the business forward faster. 

These risks and costs can be mitigated by making the transition process iterative, breaking off the vendor lock-in in small, manageable steps, and avoiding the "big bang" switch that often results in delayed delivery and increases the cost of running two systems in parallel for A|B testing. 

Fast JMS for Apache Pulsar: Modernize and Reduce Costs with Blazing Performance

Written by: Chris Bartholomew

DataStax recently announced the availability of Fast JMS for Apache Pulsar, a JMS 2.0 API. By combining the industry-standard Java Messaging Service (JMS) API with the cloud-native and horizontally scalable Apache Pulsar™ streaming platform, DataStax is providing a powerful way to modernize your JMS infrastructure, improve performance, and reduce costs. Fast JMS is open source and is included in DataStax’s Luna Streaming Enterprise support of Apache Pulsar.

10 Reasons to Choose Apache Pulsar Over Apache Kafka

Today, many data architects, engineers, dev-ops, and business leaders are struggling to understand the pros and cons of Apache Pulsar and Apache Kafka. As someone who has worked with Kafka in the past, I wanted to compare these two technologies. 

If you are looking for insights on when to use Pulsar, here are 10 advantages of the technology that might be the deciding factors for you.

Looking Under the Hood of Apache Pulsar: How Good is the Code?

Apache Pulsar (incubating) is an enterprise-grade publish-subscribe (aka pub-sub) messaging system that was originally developed at Yahoo. Pulsar was first open-sourced in late 2016, and is now undergoing incubation under the auspices of the Apache Software Foundation. At Yahoo, Pulsar has been in production for over three years, powering major applications like Yahoo! Mail, Yahoo! Finance, Yahoo! Sports, Flickr, the Gemini Ads platform, and Sherpa, Yahoo’s distributed key-value store.

One of the primary goals of the open-source software movement is to allow all users to freely modify software, use it in new ways, integrate it into a larger project or derive something new based on the original. The larger idea being that through open collaboration, we can make a software the best version of itself. So, in the spirit of innovation and improvement, we peeked behind the curtain and ran Apache Pulsar through the free static code analysis tool Embold. The results are very interesting, both from an observer's and a learner's point of view.

5 More Reasons to Choose Apache Pulsar Over Kafka

Awhile back I wrote a post about the 7 Reasons We Choose Apache Pulsar over Apache Kafka. Since then, I have been working on a detailed report comparing Kafka and Pulsar, talking to users of the open-source Pulsar project, and talking to users of our managed Pulsar service, Kafkaesque. What I’ve realized that I missed some reasons in that first post. So, I thought I’d do a follow-up post that adds to the list.

Before diving into the new reasons, let’s quickly recap the seven reasons mentioned in the previous post:

Life Beyond Kafka With Apache Pulsar


During all my years as a Solution Architect, I have built many streaming architectures, such as real-time data ETL, reactive microservices, log collection, and even AI-driven services, all using Kafka as a core part of their architecture. Kafka is a proven stream-processing platform used for many years at companies like LinkedIn, Microsoft, and Netflix. In many cases Kafka works very well, supports large amounts of data, and has a good community. Because of that, Kafka is used for many streaming scenarios.

However, due to the design of Kafka, all of my projects using Kafka have been suffering similar problems: