cloudera | The Blog Pros

June 6, 2021

Introducing Cloudera SQL Stream Builder (SSB)

Cloudera SQL Stream Builder (SSB)

The initial release of Cloudera SQL Stream Builder as part of the CSA 1.3.0 release of Apache Flink and friends from Cloudera shows an integrated environment well integrated into Cloudera's Data Platform. SSB is an improved release of Eventador's SQL Stream Builder with integration into Cloudera Manager, Cloudera Flink, and other streaming tools.

May 10, 2021

Big-Data Project Guidelines

The aim of the following article is to share with you some of the most relevant guidelines in cloud-based big data-based projects that I’ve done in my recent mission. The following list is not an exhaustive one and may be completed according to each organization/project specifications.

Guidelines for Cloud-Based and Data-Based Projects

Data Storage

Use data partitioning and/or bucketing to reduce the amount of data to process.
Use Parquet format to reduce the amount of data to store.
Prefer using SNAPPY compression for frequently consumed data, and prefer using GZIP compression for data that is infrequently consumed.
Try as much as possible to store a big enough file instead of many small files (average 256MO - 1GB ) to improve performances (R/W) and reduce costs — the file system depends on the use case needs and the underlying block storage file system.
Think about a DeltaLake/IceBerg framework before managing schema evolutions and data updates using custom solutions.
“Design by query” can help improving consumption performances — for instance, you can store the same data in different designs in different depths depending on the consumption pattern.
Secure data stored on S3 using an adapted model using versioning and archiving.

Data Processing

When developing distributed applications, re-think your code in a way to avoid as much as possible data shuffling as it leads to performance leaks.
Small table broadcasting can help achieve better performances.
Once again, use Parquet format to reduce the amount of data to process thanks to PredicatePushDown and ProjectionPushDown.
When consuming data, use as much as possible data native protocols to be close to data and avoid unnecessary calls and protocols overhead.
Before choosing a computation framework, identify first if your problem needs to be solved using parallelization or using distribution.
Think to merge and compact your files to improve performances and reduce cost while reading (Delta.io can help achieve that),

Data Locality

Move the processing next to data and not the opposite — data size is generally higher than the jobs' or scripts' sizes.
Process data in the cloud and get only the most relevant and necessary data out.
Limit the inter-region transfer.
Avoid as much data travel as possible between infrastructures, proxies, and patterns.

Various

Use a data lake for analytical & exploratory use cases and use operational databases for operational ones.
To ingest data, prefer using config-based solutions like DMS/Debezuim rather than custom solutions. Also prefer also using CDC solutions for long-term running ingests.
Make structures (table, prefix, path…) that are not aimed to be shared privately.

September 24, 2020

Real-Time Streaming Deep Learning Pipelines With DJL and Apache NiFi

Introduction:

I will be talking about this processor at Apache Con @ Home 2020 in my "Apache Deep Learning 301" talk with Dr. Ian Brooks.

Sometimes you want your Deep Learning Easy and in Java, so let's do that with DJL in custom Apache NiFi processors running in CDP Data Hubs, Private Cloud, and in laptop deployments.

December 20, 2019

Combining DJL.AI With Apache NiFi for Deep Learning Workflows

NiFi + DJL.AI = A Merry Deep Learning Christmas

Happy Mmm...FLaNK Day!

November 26, 2019

Exploring Apache NiFi 1.10: Parameters and Stateless Engine

Apache NiFi Is Now Available in 1.10!

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12344993

You can now use JDK 8 or JDK 11! I am running in JDK 11, and it seems a bit faster. A huge feature is the addition of Parameters! You can use them to pass parameters to Apache NiFi Stateless!

October 15, 2019

Migrating Apache Flume Flows to Apache NiFi: Kafka Source to Multiple Sinks

The world of streaming is constantly moving... yes I said it. Every few years some projects get favored by the community and by developers. Apache NiFi has stepped ahead and has been the go-to for quickly ingesting sources and storing those resources to sinks with routing, aggregation, basic ETL/ELT, and security. I am recommending a migration from legacy Flume to Apache NiFi. The time is now.

Below, I walk you through a common use case. It's easy to integrate Kafka as a source or sink with Apache NiFi or MiNiFi agents. We can also add HDFS or Kudu sinks as well. All of this with full security, SSO, governance, cloud and K8 support, schema support, full data lineage, and an easy to use UI. Don't get fluming mad, let's try another great Apache project.

August 23, 2019

EFM Series: Using MiNiFi Agents on Raspberry Pi 4 With Intel Movidius Neural Compute Stick 2, Apache NiFi, and AI

The good news is that Raspberry Pi 4 can run MiNiFi Java Agents, Intel Movidius Neural Compute Stick 2, and AI libraries. You can now use this 4GB of RAM device to run IoT with AI on the edge.

Flow From MiNiFi Agent Running OpenVino, SysLog Tail, and Grabbing WebCam Images

March 25, 2019

Challenges Faced While Integrating Pyspark With HBase and the Solution

This article will explain the challenges and troubleshooting steps involved while writing spark DataFrame into HBase Table using Pyspark.

Refer below to the pyspark example code:

February 8, 2019

Using Cloudera Data Science Workbench With Apache NiFi

Using Deployed Models as a Function as a Service

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP, but it will work for all CDSW regardless of install type.

In my simple example, I built a Python model that uses TextBlob to run sentiment against a passed sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.