How to Choose a Stream Processor for Your Data

Data has become integral to most organizations. So it's no wonder that stream processing has become a critical part of big data stacks. This works wonders for consolidating and interpreting large amounts of data.

There are many end-to-end solutions available for streaming data pipelines in the cloud. Not to mention many terminologies to navigate the different stream processing tools to choose from.

Fast JMS for Apache Pulsar: Modernize and Reduce Costs with Blazing Performance

Written by: Chris Bartholomew

DataStax recently announced the availability of Fast JMS for Apache Pulsar, a JMS 2.0 API. By combining the industry-standard Java Messaging Service (JMS) API with the cloud-native and horizontally scalable Apache Pulsar™ streaming platform, DataStax is providing a powerful way to modernize your JMS infrastructure, improve performance, and reduce costs. Fast JMS is open source and is included in DataStax’s Luna Streaming Enterprise support of Apache Pulsar.

Why Lambda Architecture in Big Data Processing?

Due to the exponential growth of digitalization, the entire globe is creating a minimum of 2.5 quintillion (2500000000000 Million) bytes of data every day and that we can denote as Big Data. Data generation is happening from everywhere starting from social media sites, various sensors, satellites, purchase transactions, Mobile, GPS signals, and much more. With the advancement of technology, there is no sign of the slowing down of data generation, instead, it will grow in a massive volume. All the major organizations, retailers, different vertical companies, and enterprise products have started focusing on leveraging big data technologies to produce actionable insights, business expansion, growth, etc. 

Overview

Lambda Architecture is an excellent design framework for the huge volume of data processing using both streaming as well as batch processing methods.  The streaming processing method stands for analyzing the data on the fly when it is in motion without persisting on storage area whereas the batch processing method is applied when data already in rest, means persisted in storage area like databases, data warehousing systems, etc. Lambda Architecture can be utilized effectively to balance latency, throughput, scaling, and fault-tolerance to achieve comprehensive and accurate views from the batch and real-time stream processing simultaneously.

Throttling Made Easy: Back Pressure in Akka Streams

Big data is the buzzword all over lately, but fast data is also gaining traction. If you are into data streaming, then you know it can be tedious if not done right and may result in data leaks/OutOfMemory exceptions. If you are building a service or product today, users are willing to pay lots of money to those who provide content with latency of just milliseconds.

Akka Streams

Akka Streams are a streaming module that is part of the Akka toolkit, designed to work with huge data streams to achieve concurrency in a non-blocking way by leveraging Akka toolkit's power without defining actor behaviors and methods explicitly. They also help to conceal the abstraction by ignoring what is going under the hood and help you focus on the logic needed for business.

Apache Flink With Kafka – Consumer and Producer

Overview

Apache Flink provides various connectors to integrate with other systems. In this article, I will share an example of consuming records from Kafka through FlinkKafkaConsumer and producing records to Kafka using FlinkKafkaProducer.

Setup

I installed Kafka locally and created two Topics, TOPIC-IN and TOPIC-OUT

Streaming ETL With Apache Flink – Part 1

Flink: as fast as squirrels

Introduction

After working in multiple projects involving Batch ETL through polling data sources, I started working on Streaming ETL. Streaming computation is necessary for use cases where real or near real-time analysis is required. For example, in IT Operations Analytics, it is paramount that Ops get critical alert information in real-time or within acceptable latency (near real-time) to help them mitigate downtime or any errors caused due to misconfiguration.

While there are many introductory articles on Flink (my personal favorite are blogs from Ivan Mushketyk), not many have been into details of streaming ETL and advanced aspects of the Flink framework, which are useful in a production environment.

Uploading and Downloading Files: Streaming in Node.js

While the buffer APIs are easier to use to upload and download files, the streaming APIs are a great way to better manage memory and concurrency. In this post, you'll learn how to stream files between clients, Node.js, and Oracle Database.

Overview

The streaming APIs in Node.js are designed to leverage and simplify its evented nature. There are four different stream classes: Readable, Writeable, Transform, and Duplex. Readable streams (which includes Transform and Duplex) have a method that can be thought of like the pipe command in Linux, where the standard output of one command can be piped into the standard input of another command. Here's a command line example that takes the output from the ls (list) command and pipes it through the grep (search) command to show files that have the word "oracle" in the name:

WebSockets vs. Long Polling

Sometimes we need information from our servers as soon as it’s available. The usual AJAX request/response we’re all used to doesn’t keep the connection open for this sort of use case. Instead, we need a push-based method like WebSockets, long polling, server-sent events (SSE), or, the more recently created, HTTP2 push. In this article, we compare two methods: WebSockets and long polling.

An Overview of Long Polling

In 1995, Netscape Communications hired Brendan Eich to implement scripting capabilities in Netscape Navigator and, over a ten-day period, the JavaScript language was born. Its capabilities as a language were initially very limited compared to modern-day JavaScript, and its ability to interact with the browser’s document object model (DOM) was even more limited. JavaScript was mostly useful for providing limited enhancements to enrich document consumption capabilities. For example, in-browser form validation and lightweight insertion of dynamic HTML into an existing document.

Waking Up the World of Big Data

The term "Big Data" has lost its relevance. The fact remains, though: every dataset is becoming a big data set, whether its owners and users know (and understand) that or not. Big data isn't just something that happens to other people or giant companies like Google and Amazon. It's happening, right now, to companies like yours.

Recently, at Eureka!, our annual client conference, I presented on the evolution of Big Data technologies including the different approaches that support the complex and vast amount of data organizations are now dealing with. In this post, I'll break down some of my presentation and dig into the current state of Big Data, the trends driving its evolution, and one major shift that'll deliver up massive value for companies in the next wave of Big Data's growth.

How to Handle the Influx of Data

To learn about the current and future state of databases, we spoke with and received insights from 19 IT professionals. We asked, "How can companies get a handle on the vast amounts of data they’re collecting?" Here’s what they shared with us:

Ingest

  • It’s incredibly important to ingest, store, and present it for querying. We have a lambda architecture for in-memory processing, streaming, analytics, and then very scalable data at rest for historical data. When people struggle, they’ve figured out one piece of the puzzle. They may be able to ingest data quickly, but they are not able to analyze the data and get insights. It’s all about being able to capture the data and then do valuable things with it at the same time.
  • Have an Agile data architecture. We have perfected the collection of data with data ingestion solutions like Spark and Kinesis. But there are still a lot of challenges remaining in analyzing and operationalizing the data. There is not enough scale and investment going on in those two areas. Focus on concepts like federated query. Data can reside anywhere. Optimize compute to understand where the data lives so you can produce fast results. Data labs give people their own sandbox to work on data that exists and bring compute to where the data resides.
  • We handle data at a high level with governance based on where data is coming from, it’s structure, and where it’s going. With things like GDPR, this has become more important. Ingesting data streaming in real-time is key. Stream-based ingestions with volume and noise are increasing. Bring in other technologies like Kafka to ingest. Multiplatform offer “horses for courses.” 

Query

  • The data management problem is solved with an overarching data management solution. Consider what data needs to be stored, for how long and at what granularity. For example, in banking, with mobile access, a lot of customers look at their balances when they are bored. Because we’re in-memory we can cache balance information so it’s cheap and easy for customers to get to.
  • Be able to securely store large amounts of data. Companies are using the cloud to do this because they do not have to pre-provision resources. They typically store this data in object stores like Amazon S3 or Google Cloud Storage. The second challenge is to derive value from these data sets; much of the value stays inaccessible because there is no way to query the raw data. A developer has to massage the data using various data pipelines before he/she can unlock the value of this data, and this transformation typically uses its own custom APIs. Databases make it easy to query these data sets. Databases associate a schema with the data, either at read time or write time, and make it accessible by a developer via a very standard query language like SQL. New-age databases can continuously ingest data from cloud services, like Amazon S3, Google Cloud or DynamoDB, and make it queryable via standard SQL. This makes it easier for a developer to extract value from large sets of data.
  • 1) Auditing is probably the first step. Understand what the data is, its origin, and destination. Then marry this with the overall strategy as the business and figure out whether vital data exists, whether it should be archived or whether it needs enrichment to produce meaningful data. 2) In a previous life, the first task was to run tools that would scan the network and find instances of running databases. In some cases, customers had several copies of the same data being processed by different systems costing vast amounts in infrastructure and resources. No one was using this data. This goes back to designing databases with a purpose in mind. 3) Stream Processing can play a huge part. Being able to validate, classify and enrich data, you can add context and meaning. That way you can determine how much value it may have to you. Stream processing enables organization and context, which in turn enables understanding.
  • Active analytics platform enables clients to handle data and access streaming and historical data using SQL queries. We are now able to involve graph relationship queries, also recognize the opportunity to use trained ML algorithms to run against the active analytics database.

Other

  • Delete it as fast as you possibly can. The types of customers that can and cannot delete data vary by industry. Healthcare, aerospace, finance must preserve data. Are you going to archive? Real-time, or near-time available? Do you put it in a warehouse? Is the database transactional? How up to date does the data need to be? Real-time, near real time? Balance a transactional system at run time against the analytics customers want to run. RDBMS or Elk stack? A database is a tool, don’t abuse it. Have a strategy around the data, long-term and short-term problems to address. Get it right early or it just gets more difficult.
  • Be more intelligent about how you will use the data to do novel things. Accelerate database releases to provide knowledge to the business more quickly. Be smart about equipping the right individuals to have control over their destiny. People are moving away from the monolith. Choose the right technology based on what you are trying to achieve. There are more tools today with greater specialization. Let teams chase after and test different solutions so they benefit from processing all of this.
  • It’s a challenging task to get a handle on data collection, but it’s even more challenging to provide data access. Database technologies, such as data indexing, data normalizing, and data warehousing, allow companies to systematically store and retrieve data as efficiently and effectively as possible.
  • If you collect meaningful data that you expect to be able to sort, categorize and report form it should be stored in a database! And your database strategy will be key to your operational efficiency.
  • Databases are part of the solution. Choose a data storage product based on how to get the data in and how to query it. In terms of value, it comes down to how much you need to scale out to avoid performance hits. There’s vertical and horizontal scaling. Traditional vertical databases scale well. Now the more horizontal is scaling as well. Cost is an issue. If you host in a public cloud a lot of licensing headaches are removed because the cloud vendor has worked out the details. It’s much easier to adopt a database service because you don’t have to provision hardware.
  • Traceability, lineage, governance are key.  The graph model is able to represent the open-ended complex pictures using nodes and relationships in a node model. Keep track of meta lineage but all the different identifiers for the individual, his devices, and identities. We are seeing the rise of the chief data officer and governance with GDPR and California initiative. Not unlike a data warehouse where you get the data you need based on the requirements you have. See how pieces of data correlate across the entire enterprise. What kind of data pieces do you want to see correlated and what kind of relationships do you want to discover?
  • Many companies need to have a better/more accurate understanding of how they expect their data to scale and what the projected growth rate is going to be. Granted, it can be hard to get perfectly right. (You might start with a small environment, get customers faster than anticipated, and blow out projections.) But make sure you have a way to understand what data you are collecting and what the volume is going to be – from the beginning! – cannot be overstressed. With Apache Cassandra, for example, it’s fairly easy to scale, but it’s not particularly fast to do so. You need to plan deployment with enough runway…if you hit limits, you’re going to have problems.
  • Although we are built to handle and scale high volumes of data, one of the first steps is always to get a clear picture of which data points are really important. The value of data is also changing with the location (e.g. cloud vs. edge) and over time. Exploring and learning from (and with) the data is an important part of ongoing success.
  • Use data platforms that allow you to work naturally with data of any shape or structure in its native form without having to constantly wrangle with a rigid schema. The ability to scale out on commodity infrastructure and do it across geographic regions to accommodate massive increases in data volume.
  • That’s why we developed a platform to handle the scale and diversity of data. Edge to cloud is a common use case with initial processing at the edge and then moving the data to data centers. Once the data is in a central location that’s where you can do ML, come up with models, and push insight back to the edge. When you have datasets like that, that’s where the database and streaming fits in with fast streaming and fast processing you need a platform with different data services to meet all of your needs.

Here are the contributors of insight, knowledge, and experience:

The Internet of Things: Connecting Devices and Data

Whether you recognize it or not, the Internet of Things is becoming more pervasive everywhere, and it will only become more useful as more data is collected and analyzed and connected devices become more resilient and accurate. The 2019 Guide to the Internet of Things will explore how you can get started building your own connected devices and products, and how exactly to utilize all of your data to the best of your ability.

Big Data: Volume, Variety, and Velocity

Big data is the new competitive advantage and it is necessary for businesses. With the growing proliferation of data sources such as smart devices, vehicles, and applications, the need to process this data in real-time and to deliver relevant insights is more urgent than ever. The 2019 Guide to Big Data will explore tools and ecosystems for analyzing big data and relevant use cases ranging from sustainability science to autonomous vehicles.

How to Use Redis Streams in Your Apps

Data processing has been revolutionized in recent years, and these changes present tremendous possibilities. For example, if we consider a variety of use cases — from IoT and Artificial Intelligence to user activity monitoring, fraud detection and FinTech — what do all of these cases have in common? They all collect and process high volumes of data, which arrive at high velocities. After processing this data, these technologies then deliver them to all the appropriate consumers of data.

With the release of version 5.0, Redis launched an innovative new way to manage streams while collecting high volumes of data — Redis Streams. Redis Streams is a data structure that, among other functions, can effectively manage data consumption, persist data when consumers are offline with a data fail-safe, and create a data channel between many producers and consumers. It allows users to scale the number of consumers using an app, enables asynchronous communications between producers and consumers and efficiently uses main memory. Ultimately, Redis Streams is designed to meet consumers' diverse needs, from real-time data processing to historical data access, while remaining easy to manage.