data science | The Blog Pros

July 18, 2022

For Six Sigma Black Belts: It’s Time To Break Fresh Ground With Sustainable Process Performance

The upheaval suffered by manufacturers since the onset of the COVID-19 pandemic is forcing companies to rethink their processes. It’s not surprising that Six Sigma thinkers, as process improvement experts, are high on the go-to list for help.

But for Six Sigma experts, it’s also time to rethink.

May 19, 2022

How to Utilize Python Machine Learning Models

Ever trained a new model and just wanted to use it through an API straight away? Sometimes you don't want to bother writing Flask code or containerizing your model and running it in Docker. If that sounds like you, you definitely want to check out MLServer. It's a Python-based inference server that recently went GA, and what's really neat about it is that it's a highly-performant server designed for production environments. That means that, by serving models locally, you are running in the exact same environment as they will be in when they get to production.

This blog walks you through how to use MLServer by using a couple of image models as examples.

April 21, 2022

What Is a Data Reliability Engineer, and Do You Really Need One?

As software systems became increasingly complex in the late 2000s, merging development and operations (DevOps) was a no-brainer.

One-half software engineer, one-half operations admin, and the DevOps professional are tasked with bridging the gap between building performant systems and making them secure, scalable, and accessible. It wasn’t an easy job, but someone had to do it.

April 5, 2022

The Lakehouse: An Uplift of Data Warehouse Architecture

In short, the initial architecture of the data warehouse was designed to provide analytical insights by collecting data from various heterogeneous data sources into the centralized repository and acted as a fulcrum for decision support and business intelligence (BI). But it has continued with numerous challenges, like more time consumed on data model designing because it only supports schema-on-write, the inability to store unstructured data, tight integration of computing, and storage into an on-premises appliance, etc.

This article intends to highlight how the architectural pattern is enhanced to transform the traditional data warehouse by rolling over the second-generation platform data lake and eventually turning it into a lakehouse. Although the present data warehouse supports three-tier architecture with an online analytical processing (OLAP) server as the middle tier, it is still a consolidated platform for machine learning and data science with metadata, caching, and indexing layers that are not yet available as a separate tier.

March 15, 2022

The Rise of the Data Reliability Engineer

With each day, enterprises increasingly rely on data to make decisions. This is true regardless of their industry: finance, media, retail, logistics, etc. Yet, the solutions that provide data to dashboards and ML models continue to grow in complexity. This is due to several reasons, including:

The ability to process data from diverse sources at a low cost
An explosion in the availability and variety of data tools (impacting collaboration and decision making, beyond technical work)
Tight dependencies among data assets managed by different departments within companies

This need to run complex data pipelines with minimum rates of error in such modern environments has led to the rise of a new role: the Data Reliability Engineer. Data Reliability Engineering (DRE) addresses data quality and availability problems. Comprising practices from data engineering to system operations, DRE is emerging as its own field within the broader data domain.

February 24, 2022

Kubeflow Machine Learning Tips and Tricks – February 2022

Without wasting any time, let’s dive right in!

Your Questions Our Answers

1. Is there a way to auto-stop notebooks that are idle for a long time, such as overnight? We are looking to reduce resource usage.

February 13, 2022

The Best MLOps Events and Conferences for 2022

Introduction

2021 was, quite rightly, touted as “The Year of MLOps”. The MLOps scene exploded with thousands of companies adopting practices and tools aimed at helping them get models into production faster and more efficiently. A multitude of new vendors, consultancies, and open source tooling entered the field making it more important than ever to stay on top of what’s happening.

Throughout January I’ve been asking around to find out the best MLOps events people attended last year. There were loads of great suggestions to go through but a handful kept coming up over and over again. I’ve combined those with my own experiences to create a list of the events and conferences you definitely don’t want to miss:

February 9, 2022

Getting Started With Pandas: Lesson 4

Introduction

We begin with the fourth and final article of our saga of training with Pandas. In this article, we are going to make a summary of the different functions that are used in Pandas to perform missing data treatment. Dealing with missing data is key and a standard challenge of day-by-day data science work, and it has a direct impact on algorithmic performance.

Missing Data

Before we start, we are going to visualize the example dataset that we are going to follow to explain the functions. It is a dataset created by us that includes several cases of use to be able to clearly deal with all the examples that we will call `uncompleted_data`.

January 24, 2022

Synchronous Replication in Tarantool (Part 3)

Read part 1 of this article here: Synchronous Replication in Tarantool (Part 1).

Read part 2 of this article here: Synchronous Replication in Tarantool (Part 2).

January 22, 2022

Getting Started With Numpy

NumPy is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis.

Creating, Getting Info, Selecting, and Util Functions

The 2009 data set 'Wine Quality Dataset' elaborated by Cortez et al. available at UCI Machine Learning, is a well-known dataset that contains wine quality information. It includes data about red and white wine physicochemical properties and a quality score.

January 20, 2022

Exploring CockroachDB with ipython-sql and Jupyter Notebook

Today, I will demonstrate how ipython-sql can be leveraged in querying CockroachDB. This will require a secure instance of CockroachDB for the reasons I will explain below.

Running a secure docker-compose instance of CRDB is beyond the scope of this tutorial. Instead, I will publish everything you need to get through the tutorial in my repo, including the Jupyter Notebook. You may also use CRDB docs to stand up a secure instance and change the URL in the notebook to follow along.

This post will dive deeper into the Python ecosystem and build on my previous Python post. Instead of reliance on pandas alone, we're going to use a popular SQL extension called ipython-sql, a.k.a. SQLmagic to execute SQL queries against CRDB.

As stated earlier, we need to use a secure instance of CockroachDB. In fact, from this point forward, I will attempt to write posts only with secure clusters, as that's the recommended approach. Ipython-sql uses sqlalchemy underneath and it expects database URLs in the format postgresql://username:password@hostname:port/dbname. CockroachDB does not support password fields with insecure clusters, as passwords alone will not protect your data.

January 5, 2022

Kubeflow Fundamentals Part 6: Working With Jupyter Lab Notebooks

Welcome to the sixth blog post in our “Kubeflow Fundamentals” series specifically designed for folks brand new to the Kubelfow project. The aim of the series is to walk you through a detailed introduction of Kubeflow, a deep-dive into the various components, add-ons, and how they all come together to deliver a complete MLOps platform.

If you missed the previous installments in the “Kubeflow Fundamentals” series, you can find them here:

December 30, 2021January 7, 2022

Top 7 Reasons To Opt For a Data Science Course in 2022

In today's world, data is something that businesses and industries cannot do without. As technology is progressing, data is becoming very useful for organizations for decision-making and predicting future business trends. In today’s data-driven business environment, it is necessary to have personnel who can understand data and figures.

Data science is a requisite if one wants to manipulate data technically to get meaningful inferences. Having the knowledge and mastery of data science makes a tech professional very valuable for a firm, and in turn, makes the stream a lucrative career option. Insights about customer behavior help companies focus their business towards the target audience and grow in the right direction by helping to leverage the power of data. Useful insights and meaningful forecasting models can be extracted by data scientists from raw data.

December 4, 2021

Getting Started With Pandas – Lesson 2

Introduction

We begin with the second post of our training saga with Pandas. In this article, we are going to make a summary of the different functions that are used in Pandas to perform Indexing, Selection, and Filtering.

Indexing, Selecting, and Filtering

Before we start, we are going to visualize ahead of our didactic dataset that we are going to follow to show the examples. It is a well-known dataset that contains wine information.

December 4, 2021

Getting Started With Pandas – Lesson 3

Introduction

We begin with the third post of our data science training saga with Pandas. In this article, we are going to make a summary of the different functions that are used in Pandas to perform Iteration, Maps, Grouping, and Sorting. These functions allow us to make transformations of the data giving us useful information and insights.

Iteration, Maps, Grouping, and Sorting

The 2009 data set ‘Wine Quality Dataset’ elaborated by Cortez et al. available at UCI Machine Learning, is a well-known dataset that contains wine quality information. It includes data about red and white wine physicochemical properties and a quality score.

December 1, 2021

Machine Learning Interview With Gema Parreño, Lead Data Scientist at Apiumhub

Today we have interviewed our Gema Parreño, Lead Data Scientist at a software development company, Apiumhub, where she develops Data-Driven Solutions. She is passionate about the intersection of machine learning and games, and has had her own startup, contributed to the open source space in the StarCraft machine learning project, and had an experience at Google Brain for Stadia.

Gema gives a talk about Mempathy as an AI Safety and Alignment opportunity, and we wanted to dig deeper and find out more about it, as well as how the idea arose to use it for implementation of Safety and Alignment techniques.

November 30, 2021

Data Governance and Data Management

Introduction

Enterprises that don’t embrace data or are late to the party face serious consequences compared to early adopters. As to talking about good data practices, most people associate the word with only a few of the multitude of practices that constitute a successfully run, data-driven enterprise.  

Besides data analysis, data management is what readily comes to mind. Though equally universal — and perhaps are even more critical — data practice is the practice of data governance.  

November 26, 2021

The 10 Commandments for Performing a Data Science Project

In designing a data science project, establishing what we, or the users we are building models for, want to achieve is vital, but this understanding only provides a blueprint for success. To truly deliver against a well-established brief, data science teams must follow best practices in executing the project. To help establish what that might mean, I have come up with ten points to provide a framework that can be applied to any data science project.

1. Understand the Problem

The most fundamental part of solving any problem is knowing exactly what problem you are solving. Make sure that you understand what you are trying to predict, any constraints, and what the ultimate purpose for this project will be. Ask questions early on and validate your understanding with peers, domain experts, and end-users. If you find that answers are aligning with your understanding, you know that you are on the right path.