alluxio | The Blog Pros

April 13, 2022

Best Practices for Apache Ranger-Based Authorization for Your Data Platform

Introduction

Alluxio enables data orchestration for compute in any cloud. It unifies data silos on-premise and across cloud environments to provide the data locality, accessibility, and elasticity needed to reduce the complexities associated with orchestrating data for today’s big data and AI/ML workloads.

Alluxio is designed to help any framework access any data, from any storage at high performance regardless of the environment, which enables an organization to remain agile and competitive in adopting and experimenting with new and existing technologies.

March 28, 2022

Using Consistent Hashing in Presto to Improve Caching Data Locality in Dynamic Clusters

Running Presto with Alluxio is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot datasets close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.

Soft Affinity Scheduling

Presto uses a scheduling strategy called soft affinity scheduling to schedule a split (smallest unit of data processing) to the same Presto worker (preferred node). The mapping from a split and a Presto worker is computed by a hashing function on the split, making sure the same split will always be hashed to the same worker. The first time a split is processed, data will be cached on the preferred worker node. When subsequent queries process the same split, these requests will be scheduled to the same worker node again. Since data is already cached locally, no remote read will be necessary.

March 11, 2022

How To Build a Self-Serve Data Architecture for Presto Across Clouds

This article highlights the synergy between the two widely adopted open-source projects, Alluxio and Presto, and demonstrates how together they deliver a self-serve data architecture across clouds.

What Makes an Architecture Self-Serve?

Condition 1: Evolution of the Data Platform Does Not Require Changes

All data platforms evolve over time, including the addition of a new data store, compute engine, or a new team that needs to access shared data. In either case, a data platform is self-serve if it does not require changes to accommodate evolution.

October 14, 2021

Efficient Model Training in the Cloud with Kubernetes, TensorFlow, and Alluxio

Alibaba Cloud Container Service Team Case Study

This article presents the collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problem of Deep Learning model training in the cloud. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture. Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in an over 40% reduction in training time and cost.

1. New Trends in AI: Kubernetes-Based Deep Learning in The Cloud

Background

Artificial neural networks are trained with increasingly massive amounts of data, driving innovative solutions to improve data processing. Distributed Deep Learning (DL) model training can take advantage of multiple technologies, such as:

September 10, 2021

Building a High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go

T3Go is China’s first platform for smart travel based on the Internet of Vehicles. In this article, Trevor Zhang and Vino Yang from T3Go describe the evolution of their data lake architecture, built on cloud-native or open-source technologies including Alibaba OSS, Apache Hudi, and Alluxio. Today, their data lake stores petabytes of data, supporting hundreds of pipelines and tens of thousands of tasks daily. It is essential for business units at T3Go including Data Warehouse, Internet of Vehicles, Order Dispatching, Machine Learning, and self-service query analysis.

In this blog, you will see how we slashed data ingestion time by half using Hudi and Alluxio. Furthermore, data analysts using Presto, Hudi, and Alluxio saw the queries speed up by 10 times. We built our data lake based on data orchestration for multiple stages of our data pipeline, including ingestion and analytics.

August 11, 2020

Reducing Large S3 API Costs Using Alluxio

I. Introduction

Previous Works

There have been numerous articles and online webinars dealing with the benefits of using Alluxio as an intermediate storage layer between the S3 data storage and the data processing system used for ingestion or retrieval of data (i.e. Spark, Presto), as depicted in the picture below:

To name a few use cases:

November 8, 2019

Introducing Wormhole: Fast Dockerized Presto and Alluxio Setups

Just like a real wormhole, this tool is all about speed.

This blog introduces Wormhole, an open-source Dockerized solution for deploying Presto and Alluxio clusters for blazing-fast analytics on file system (we use S3, GCS, OSS). When it comes to analytics, generally people are hands-on in writing SQL queries and love to analyze data that resides in a warehouse (e.g. MySQL database). But as data grows, these stores start failing and there arises a need for getting the faster results in the same or a shorter time frame. This can be solved by distributed computing and Presto is designed for that. When attached to Alluxio, it works even more, faster. That’s what Wormhole is all about.

You may also enjoy: Alluxio Cluster Setup Using Docker

Here is the high-level architecture diagram of solution:

October 15, 2019

Getting Started With EMR Hive on Alluxio in 10 Minutes

Find out what the buzz is behind working with Hive and Alluxio.

This tutorial describes steps to set up an EMR cluster with Alluxio as a distributed caching layer for Hive, and run sample queries to access data in S3 through Alluxio.

You may also enjoy: Distributed Data Querying With Alluxio

Prerequisites

Install AWS command line tool on your local laptop. If you are running Linux or macOS, it is as simple as running pip install awscli.
Create an from the EC2 console if you don’t have an existing one.

Step 1: Create an EMR Cluster

First, let's create an EMR cluster with Hive as its built-in application and Alluxio as an additional application through bootstrap scripts. The following command will submit a query to create such a cluster with one master and two workers instances running on EC2. Remember to replace “alluxio-aws-east” in the following command with your AWS keypair name, and “m4.xlarge” with the EC2 instance type you like to use. Check out this page for more details of this bootstrap script.

September 26, 2019

Making a Secure Plug-and-play Distributed File System Service Using Alluxio in Baidu

This article describes how Baidu creates a secure, modular and extensible distributed file system service in project Pingo – a big data analytics solution for enterprises – based on the open-source project Alluxio.

In this article, you will learn how to incorporate Alluxio to implement a unified distributed file system service as well as how to add extensions on top of Alluxio including customized authentication schemes and UDF (user-defined functions) on Alluxio files.

September 5, 2019

Big Data Tutorial: Running Alluxio On HashiCorp Nomad

Get Nomad working for you

I recently worked on a PoC evaluating Nomad for a client. Since there were certain constraints limiting what was possible on the client environment, I put together something “quick” on my personal workstation to see what was required for Alluxio to play nice with Nomad.

Getting up and running with Nomad is fairly quick and easy; download the compressed binary, extract it, and start the Nomad agent in dev mode. Done! Getting Alluxio to run on Nomad turned out to be a little more involved than I thought. One major issue I ran into quite early on in the exercise was that Nomad doesn’t yet support persistent storage natively (expected in the next release).

August 26, 2019

Running Alluxio-Presto Sandbox in Docker

The Alluxio-Presto sandbox is a Docker application featuring installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and see the performance benefits of using Alluxio in a big data software stack.

In this guide, we’ll be using Presto and Alluxio to showcase how Alluxio can improve Presto’s query performance by caching our data locally so that it can be accessed at memory speed!

August 22, 2019

Four Different Ways to Write in Alluxio

Alluxio is an open-source data orchestration system for analytics and AI workloads. Distributed applications like Apache Spark or Apache Hive can access Alluxio through its HDFS-compatible interface without code change. We refer to external storage such as HDFS or S3 as under storage. Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and also the level of fault tolerance compared to HDFS.

Given an application such as a Spark job that saves its output to an external storage service; Writing the job output to the memory layer in a colocated Alluxio worker will achieve the best write performance. Due to the volatility of memory, when a node in Alluxio goes down or restarts, any data in that node’s memory is lost. To prevent data loss, Alluxio provides the ability to write the data to the persistent under storage either synchronously or asynchronously by configuring client-side Write Types. Each Write Type has benefits and drawbacks associated with it. Applications that write to Alluxio storage should consider the different write types and perform a cost-benefit analysis to determine the write type that is best suited for the application requirements.

August 12, 2019

Creating Grafana Dashboards to Visualize Alluxio Metrics

Overview

Monitoring metrics is highly important to operate distributed systems in production. Alluxio collects metrics using the Codahale Metrics Library on I/O throughput, RPC throughput, and resource usage. Alluxio metrics are shown in its webUI but are also available through a REST endpoint or exportable to several third-party sinks in a time-series manner (see docs).

Grafana, a comprehensive metrics visualization software, ties into this process by pulling the metrics that systems like Alluxio collect through a sink and visualizes them in a more helpful fashion. This guide will cover how to set up Grafana and Graphite, a supported sink for Alluxio, which will put metrics in a time-series database, along with exploring some of the possibilities that the combination offers.

July 24, 2019

The Practice of Alluxio in Ctrip Real-Time Computing Platform

Today, a real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to reduce the workload pressure on our HDFS NameNode.

Background and Architecture

Ctrip.com is the largest online travel booking website in China. It provides online travel services including hotel reservations, transportation ticketing, packaged tours, with hundreds of millions of online visits every day. Driven by the high demand, a massive amount of data is stored in big data platforms in different formats. Handling nearly 300,000 offline and real-time analytics jobs every day, our main Hadoop cluster is at the scale of a thousand servers, with more than 50PB of data stored and increasing by 400TB daily.

July 10, 2019

Turn Cloud Storage or HDFS Into Your Local File System for Faster AI Model Training With TensorFlow

Users today have a variety of options of cost-effective and scalable storage for their Big Data or machine learning applications, from the distributed storage system like HDFS, Ceph to cloud storage like AWS S3, Azure Blob store, and Google Cloud Storage. These storage technologies have their own APIs. This means that developers need to constantly learn new storage APIs and develop their code using these APIs. In some cases, for example, in machine learning/deep learning workloads, the frameworks don’t have integrations to all the needed storage-level APIs, and a lot of data engineering needs to be done to move the data around. It has become common practice to move data sets from the HDFS data lake to the local compute instances of the data scientist to achieve data locality and access data via the local file system.

This article aims to provide a different approach to help connect and make distributed files systems like HDFS or cloud storage systems look like a local file system to data processing frameworks: the Alluxio POSIX API. To explain the approach better, we used the TensorFlow + Alluxio + AWS S3 stack as an example.

June 20, 2019

Distributed Data Querying With Alluxio

This blog is about how I used Alluxio to reduce p99 and p50 query latencies and optimized the overall platform costs for a distributed querying application. I walk through the product and architecture decisions that lead to our final architecture, discuss the tradeoffs, share some statistics on the improvements, and discuss future improvements to the system.

Description

A wireframe of a dashboard with drag and drop functionality.

April 13, 2019August 9, 2019

Moving From Apache Thrift to gRPC: A Perspective From Alluxio

As part of the Alluxio 2.0 release, we have moved our RPC framework from Apache Thrift to gRPC. In this article, we will talk about the reasons behind this change as well as some lessons we learned along the way.

Alluxio is an open-source distributed virtual file system, acting as the data access layer that enables big data and ML applications to process data from multiple heterogeneous storage systems with locality and many other benefits. Alluxio is built with a master/worker architecture where masters handle metadata operations and workers handle requests to read and write data. In Alluxio 1.x, the RPC communication between clients and servers is built mostly on top of Apache Thrift. Thrift enabled us to define Alluxio service interface in simple IDL files and implement client binding using native Java interfaces generated by Thrift compiler. However, we faced several challenges as we continued developing new features and improvements for Alluxio.

April 11, 2019

Build a New Computing Platform Based on Open Source to Serve Mobile Users in China

China Unicom is one of the five largest telecom operators in the world. China Unicom’s booming business in 4G and 5G networks has to serve an exploding base of hundreds of millions of smartphone users. This unprecedented growth brought enormous challenges and new requirements to the data processing infrastructure. The previous generation of its data processing system was based on IBM midrange computers, Oracle databases, and EMC storage devices. This architecture could not scale to process the amounts of data generated by the rapidly expanding number of mobile users. Even after deploying Hadoop and Greenplum database, it was still difficult to cover critical business scenarios with their varying massive data processing requirements. The complicated the architecture of its incumbent computing platform created a lot of new challenges to effectively use resources.

Fortunately, there is a new generation of distributed computation frameworks that can help China Unicom meet the enormous data challenges for a variety of business scenarios. To solve these problems, the company built a new software stack of Apache Spark, Alluxio, HDFS, Hive and Apache Kafka. We leverage Alluxio as the core component for a unified, memory-centric distributed data processing platform with consolidated resources, and improved computation efficiency.