The Complete Apache Spark Collection [Tutorials and Articles]

In this edition of "Best of DZone," we've compiled our best tutorials and articles on one of the most popular analytics engines for data processing, Apache Spark. Whether you're a beginner or are a long-time user, but have run into inevitable bottlenecks, we've got your back!

Before we begin, we'd like need to thank those who were a part of this article. DZone has and continues to be a community powered by contributors like you who are eager and passionate to share what they know with the rest of the world. 

Migrating Apache Flume Flows to Apache NiFi: Kafka Source to Multiple Sinks

The world of streaming is constantly moving... yes I said it. Every few years some projects get favored by the community and by developers. Apache NiFi has stepped ahead and has been the go-to for quickly ingesting sources and storing those resources to sinks with routing, aggregation, basic ETL/ELT, and security. I am recommending a migration from legacy Flume to Apache NiFi. The time is now.

Below, I walk you through a common use case. It's easy to integrate Kafka as a source or sink with Apache NiFi or MiNiFi agents. We can also add HDFS or Kudu sinks as well. All of this with full security, SSO, governance, cloud and K8 support, schema support, full data lineage, and an easy to use UI. Don't get fluming mad, let's try another great Apache project.

Wake Up From the Big Data Nightmare

Your data scientists putting the whole team (and data sets) on their backs

If you don’t actually work with Big Data, and you only know about it from what you hear in the media — how it can be used to optimize traffic flows, make financial trade decisions, foil terrorist plots, make devices smarter and self-operating, and even track athletic performance — you’ll probably say it’s a dream come true.  

However, for those who actually extract, analyze, and manage Big Data so it can do all those wondrous things, it’s often nothing but a nightmare.

Running Alluxio-Presto Sandbox in Docker

The Alluxio-Presto sandbox is a Docker application featuring installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and see the performance benefits of using Alluxio in a big data software stack.

In this guide, we’ll be using Presto and Alluxio to showcase how Alluxio can improve Presto’s query performance by caching our data locally so that it can be accessed at memory speed!

The Practice of Alluxio in Ctrip Real-Time Computing Platform

Today, a real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to reduce the workload pressure on our HDFS NameNode.

Background and Architecture

Ctrip.com is the largest online travel booking website in China. It provides online travel services including hotel reservations, transportation ticketing, packaged tours, with hundreds of millions of online visits every day. Driven by the high demand, a massive amount of data is stored in big data platforms in different formats. Handling nearly 300,000 offline and real-time analytics jobs every day, our main Hadoop cluster is at the scale of a thousand servers, with more than 50PB of data stored and increasing by 400TB daily.

Hadoop vs. Snowflake

A few years ago, Hadoop was touted as the replacement for the data warehouse which is clearly nonsense. This article is intended to provide an objective summary of the features and drawbacks of Hadoop/HDFS as an analytics platform and compare these to the cloud-based Snowflake data warehouse.

Hadoop: A Distributed, File-Based Architecture

First developed by Doug Cutting at Yahoo! and then made open source from 2012 onwards, Hadoop gained considerable traction as a possible replacement for analytic workloads (data warehouse applications), on expensive MPP appliances.

Software Ate the World and Now the Models Are Running It

Along with our data ecosystem partners, we are seeing unprecedented demand for solutions to complex, business-critical challenges in dealing with data.

Consider this. Data Engineers walk into work every day knowing they’re fighting an uphill battle. The root of the problem – or at least one problem – is that modern data systems are becoming impossibly complex. The burgeoning amount of data being processed in organizations today is staggering, where annual data growth is often measured in high double-digit percentages. Just a year ago, Forbes reported that 90% of the world’s data was created in the previous two years.

Big Data and Hadoop: An Introduction

A very common misconception is that big data is some technology or tool. Big data, in reality, is a very large, heterogeneous set of data. This data comes more in ab unstructured or semi-structured form, so extracting useful information is very difficult. With the growth of cloud technologies, the generation rate of data has increased tremendously.

Therefore, we need a solution that allows us to process such "Big Data" at optimal speed and to do so without compromising data security. There are a cluster of technologies that deal with this and one of the best is Hadoop.

Thinking in MapReduce, but With SQL

For those considering Citus, if your use case seems like a good fit, we often are willing to spend some time with you to help you get an understanding of the Citus database and what type of performance it can deliver. We commonly do this in a roughly two-hour pairing session with one of our engineers. We'll talk through the schema, load up some data, and run some queries. If we have time at the end, it is always fun to load up the same data and queries into single node Postgres and see how we compare. After seeing this for years, I still enjoy seeing performance speed ups of 10 and 20x over a single node database, and in cases as high as 100x

And the best part is it didn't take heavy re-architecting of data pipelines. All it takes is just some data modeling and parallelization with Citus. 

Database Operations on Cassandra and Oracle Using Apache Spark

In this article, I will be doing operations that write and read to Cassandra database using Spark. I hope there will be a useful article in terms of awareness.

The rapid growth of data sources and data volumes has made it difficult to process for collected data. However, the need to process the data has increased. Following these needs and challenges, various solutions have been produced for rapid analysis and storage of big data. Spark is one of the common solutions used to process big data. Cassandra is one of the most widely used databases for storing and questioning big data. Now, we will try to use these two current technologies together.