storage | The Blog Pros

July 20, 2022

Why a Cloud-Native Database Must Run on K8s

We’ve been talking about migrating workloads to the cloud for a long time, but a look at the application portfolios of many IT organizations demonstrates that there’s still a lot of work to be done. In many cases, challenges with persisting and moving data in clouds continue to be the key limiting factor slowing cloud adoption, despite the fact that databases in the cloud have been available for years.

For this reason, there has been a surge of recent interest in a data infrastructure that is designed to take maximum advantage of the benefits that cloud computing provides. A cloud-native database is one that achieves the goals of scalability, elasticity, resiliency, observability, and automation; the K8ssandra project is a great example. It packages Apache Cassandra and supporting tools into a production-ready Kubernetes deployment.

February 22, 2022

Installing Private S3 Storage With MinIO on Alibaba Cloud Kubernetes

In this article, we will explore the step-by-step installation of private S3-compatible storage server MinIO on Alibaba Cloud Container Service Kubernetes. We will expose MinIO web UI to the internet and make MinIO API available for MC CLI in the Cloud Shell.

MinIO is an open-source, high-performance, S3-compatible object storage. It allows building AWS S3 compatible data infrastructure.

January 26, 2022

A New Approach to Solve I/O Challenges in the Machine Learning Pipeline

Background

The drive for training accuracy leads companies to develop complicated training algorithms and collect a large amount of training data with which single-machine training takes an intolerable long time. Distributed training seems promising in meeting the training speed requirements but faces the challenges of data accessibility, performance, and storage system stability in dealing with I/O in the machine learning pipeline.

Solutions

The above challenges can be addressed in different ways. Traditionally, two solutions are commonly used to help resolve data access challenges in distributed training. Beyond that, Alluxio provides a different approach.

December 15, 2021December 16, 2021

Metadata Synchronization in Alluxio: Design, Implementation, and Optimization

Metadata synchronization (sync) is a core feature in Alluxio that keeps files and directories consistent with their source of truth in under-storage systems, thus making it simple for users to retrieve the latest version of data from Alluxio. Meanwhile, understanding the internal process is important in order to tune the performance. This article describes the design and the implementation in Alluxio to keep metadata synchronized.

Why is Metadata Sync Critical in Alluxio

In Alluxio, metadata refers to the information of files and directories in the Alluxio file system, including information of their owners, groups, permission, creation and modification time, etc. Metadata is independent of their content — even if a file or directory is empty, it still has associated metadata.

December 6, 2021

Speeding Up the AI Supercomputing Platform – Practice at Unisound

Unisound has built the industry-leading GPU/CPU heterogeneous computing and distributed file system, called Atlas. This platform provides AI applications with high-performance computing and data access capabilities at a massive scale. Based on the Kubernetes open-source architecture, the Unisound team has developed the core features and successfully built an AI supercomputing platform with a floating-point processing capacity of more than 10 PFLOPS (100 million times per second). The platform supports the main machine learning frameworks, and developers can efficiently research and develop core applications such as voice, NLP, big data, multimodal, etc. The platform also serves external customers such as SMBs and research institutions with customized computing and storage capabilities.

Problems and Challenges

On the Atlas platform, computation is decoupled from storage. At present, the interconnections among the storage servers, the computing servers, and between the computing and storage servers are 100GB InfiniBand.

October 21, 2021

Alluxio Use Cases Overview: Unify silos With Data Orchestration

This blog is the first in a series introducing Alluxio as the data platform to unify data silos across heterogeneous environments. The next blog will include insights from PrestoDB committer Beinan Wang to uncover the value for analytics use cases, specifically with PrestoDB as the compute engine.

The ability to quickly and easily access data and extract insights is increasingly important to any organization. With the explosion of data sources, the trends of cloud migration, and the fragmentation of technology stacks and vendors, there has been a huge demand for data infrastructure to achieve agility, cost-effectiveness, and desired performance.

August 22, 2021

Possible Problems in Dead Hard Drive Recovery

A dead hard drive can result in permanent loss of data, which could be worth sizably more than the storage hardware. This draws from the fact that over the past 5–7 years, hard drives’ storage capacity has increased significantly, while their costs have come down to as low as $40 for a 1 TB drive. Today, it’s common for even individuals to have 4 TB hard drives for storing or backing up critical data for long-term use.

Imagine if your hard drive, storing terabytes of precious data, fails and turns out dead! You could lose your lifetime's worth of digital files like family get-togethers and outing pics, Junior’s baseball game video, crucial project data, and whatnot. The situation looks terrifying, and it is incredibly hard to tackle— how do you deal with a dead hard drive and recover your data?

May 19, 2021

Storage Format in Nebula Graph v2.0.0

Nebula Graph 2.0 has changed a lot over its releases. In the storage architecture design, the encoding format has been changed, which has the most significant impact on its users. In Nebula Graph, data is stored as KV-pairs in RocksDB. This article covers several issues such as the differences between the old and new encoding formats and why the format must be changed.

Encoding Format in Nebula Graph 1.0

Let’s start with a brief review of the encoding format in Nebula Graph 1.0. For those who are not familiar with it, I recommend that they read this post: An Introduction to Nebula Graph’s Storage Engine. In Nebula Graph 1.0, the vertex IDs can only be represented by values of the int type, so all VertexIDs are stored in int64.

April 22, 2021

Configure an NFS Storage Class on an Existing KubeSphere Cluster and Create a PersistentVolumeClaim

In my last article, I talked about how to use KubeKey to create a Kubernetes and KubeSphere cluster together with NFS storage. In fact, KubeSphere provides you with great flexibility as you can use KubeKey to install NFS storage when you create a cluster while it can also be deployed separately on an existing cluster.

KubeSphere features a highly interactive dashboard where virtually all the operations can be performed. In this article, I am going to demonstrate how to configure an NFS storage class on your existing KubeSphere cluster and create a PVC using the storage class.

February 19, 2021

Working With Persistent Volumes in Kubernetes

Introduction

The main reason behind containerization is to allow microservices to run in a stateless way. A container will receive provisioned cloud resources, perform its tasks, and then be destroyed as soon as the process is over. There are no traces of that container or tied up cloud resources to worry about. This was what has made containerization so popular in the first place.

Running microservices as stateless instances, however, is not always as easy as it seems. As more applications get refactored and more microservices rely on containers for efficiency, sticking with the stateless concept becomes harder and harder. Stateless containers don’t always have the ability to meet complex requirements.

January 28, 2021

A Storage Hack for Bringing Stateful Apps to Kubernetes: Data That Follows Applications

Kubernetes, the open-source container orchestration system created by Google, is one of the most adopted technologies of the last decade. It is clear everyone loves this open-source platform, as the double-digit growth in adoption rate clearly demonstrates.

In fact, the Cloud Native Computing Foundation (CNCF) found that in 2019 84% ran Kubernetes containers in production, double from two years prior. This growth in adoption is unlikely to stop any time soon, seeing how Kubernetes is an efficient way to manage containers at scale, which translates into lower costs and increased cloud flexibility.

December 31, 2020

Which AWS Storage Solution Is Right for Your Elasticsearch Cluster?

Amazon Web Services (AWS) is one of the most competent cloud service providers around right now. It offers a number of different kinds of storage. It provides low-cost data storage with high durability and high availability.

This article will help you to understand the different storage services and features available in the AWS Cloud and how to select the right the storage type for your ELK stack.

July 7, 2020

Optimize AWS Solution Architecture for Performance Efficiency

Amazon Web Services (AWS) offers various resources and services to help you build SaaS and PaaS solutions, however, the challenge is to achieve and maintain performance efficiency that has its own important share in delivering business value. This article highlights some of the best practices for designing and operating reliable, secure, efficient, and cost-effective cloud applications that offer performance efficiency. There are two primary areas to focus on:

Select and Configure cloud resources for higher performance
Review and Monitor Performance

June 19, 2020

Datasources, what, why, how?

Hope this post clarifies how a datasource works within a Java EE server and the reasons why you would need a XA Datasource when you have distributed transactions.

The origin of this post is to provide a basic understanding so that the user feels confident when to use or not to use Datasources and XA Datasources.

June 11, 2020

Creating EFS Using CloudFormation and Mounting it With EC2 Linux Instance

There are multiple ways of storing information on an instance, like EBS or EFS. EBS is Elastic Block Storage and can be considered as if you have a high capacity Storage Device attached to your computer. Whereas EFS is Elastic File Storage and can be considered as if you have attached an external storage device attached to your computer. It may depend on your application or use case you choose to use what among both, but for our case, that we are discussing today, we are going to use EFS with EC2 Linux instance.

Create EFS Using CloudFormation

Let's create EFS using CloudFormation. You can use the following template to create the resource. Just pass the appropriate values when asked while creating the resource.

March 19, 2020

Getting Started With OpenEBS and Cloud-Native Distributed SQL

OpenEBS is a CNCF project that provides cloud-native, open source container attached storage (CAS). OpenEBS delivers persistent block storage and other capabilities such as integrated back-up, management of local and cloud disks, and more. For enterprise cloud-native applications, OpenEBS provides storage functionality that is idiomatic with cloud-native development environments, with granular storage policies and isolation that enable cloud developers and architects to optimize storage for specific workloads.

Because YugabyteDB is a cloud-native, distributed SQL database that runs in Kubernetes environments, it can interoperate with OpenEBS and many other CNCF projects.

October 23, 2019

Docker Container – Volume and Data Recovery

Here's how to recover some of your data from stopped containers.

In this article, we will look at how to run a Docker container using volume mount and recover data from the Docker volume when a container is crashed/exited.

You may also enjoy: How to Back Up Your Data Volumes to Docker Hub

Run the Docker Container With Volume Using the `-v` parameter

Syntax:

September 5, 2019

Big Data Tutorial: Running Alluxio On HashiCorp Nomad

Get Nomad working for you

I recently worked on a PoC evaluating Nomad for a client. Since there were certain constraints limiting what was possible on the client environment, I put together something “quick” on my personal workstation to see what was required for Alluxio to play nice with Nomad.

Getting up and running with Nomad is fairly quick and easy; download the compressed binary, extract it, and start the Nomad agent in dev mode. Done! Getting Alluxio to run on Nomad turned out to be a little more involved than I thought. One major issue I ran into quite early on in the exercise was that Nomad doesn’t yet support persistent storage natively (expected in the next release).