data science | The Blog Pros

November 23, 2021

How Technical Operations Can Build on the Success of Data Science Notebooks

Data science notebooks, a popular document format used for publishing code, results, and explanations in readable and executable form, broke new ground by combining an ongoing narrative with interactive elements and displays. The result was a new way to capture and transfer knowledge about the process of discovering insights. By studying why data science notebooks have worked so well, we can understand more about related areas with similar characteristics, such as Technology Operations (TechOps).

At first glance, many of the attributes of data science notebooks also apply to TechOps. However, the data scientist and TechOps cohort have different objectives. A data scientist is interested in variable results based on changing elements within queries. A TechOps team responsible for complex operational systems looks for variables and patterns, seeks to understand the root cause, and takes corrective action. Data science notebooks are conducive to instruction and are easy to change. However, in a production operations setting, things need to be repeatable rather than variable. To align with the different user needs in TechOps, the notebook concept evolved into runbooks.

November 10, 2021

Kubeflow Fundamentals Part 4: External Add-ons

Welcome to the fourth blog post in our “Kubeflow Fundamentals” series specifically designed for folks brand new to the Kubelfow project. The aim of the series is to walk you through a detailed introduction of Kubeflow, a deep-dive into the various components, add-ons, and how they all come together to deliver a complete MLOps platform.

If you missed the previous installments in the “Kubeflow Fundamentals” series, you can find them here:

October 24, 2021

Bias vs. Fairness vs. Explainability in AI

Over the last few years, there has been a distinct focus on building machine learning systems that are, in some way, responsible and ethical. The terms “Bias,” “Fairness,” and “Explainability” come up all over the place but their definitions are usually pretty fuzzy and they are widely misunderstood to mean the same thing. This blog aims to clear that up.

Bias

Before we look at how bias appears in machine learning, let’s start with the dictionary definition for the word:

“Inclination or prejudice for or against one person or group, especially in a way considered to be unfair.”

Look! The definition of bias includes the word “unfair.” It’s easy to see why the terms bias and fairness get confused for each other a lot.

October 17, 2021

Kubeflow Fundamentals Part 3: Distributions and Installations

Welcome to the third blog post in our “Kubeflow Fundamentals” series specifically designed for folks brand new to the Kubelfow project. The aim of the series is to walk you through a detailed introduction of Kubeflow, a deep-dive into the various components, and how they all come together to deliver a complete MLOps platform.

If you missed the previous installments in the “Kubeflow Fundamentals” series, you can find them here:

September 26, 2021

Processing 3D Data Using Python Multiprocessing Library

Today we’ll cover the tools that are very handy with large amount of data. I'm not going to tell you only general information that might be found in manuals but share some little tricks that I’ve discovered, such as using tqdm with multiprocessing imap, working with archives in parallel, plotting and processing 3D data, and how to search for a similar object within object meshes if you have a point cloud.

So why should we resort to parallel computing? Nowadays, if you work with any kind of data you might face problems related to "big data". Each time we have the data that doesn’t fit the RAM we need to process it piece by piece. Fortunately, modern programming languages allow us to spawn multiple processes (or even threads) that work perfectly on multi-core processors. (NB: That doesn’t mean that single-core processors cannot handle multiprocessing. Here’s the Stack Overflow thread on that topic.)

August 27, 2021

Top 8 Skills To Have to Find a Data Analyst Job

Each company makes endeavors to accumulate data, for example, by checking its rivals' exhibitions, marketing projections, and purchasing patterns, etc. with an end goal to be more serious. Nonetheless, it's not possible for anyone to comprehend clients' practices and rivals' exhibitions without the ability to break down all that data.

The data analyst skills of an individual allude to his capacity to gather and sort out information with the end goal that it is converted into important data. This article will give you the bits of knowledge and patterns that a data analyst skills of an individual can assist with revealing can end up being extremely helpful in making prompt just as future business choices. Let’s get started.

August 25, 2021

Boosted Embeddings with Catboost

Introduction

When working with a large amount of data, it becomes necessary to compress the space with features into vectors. An example is text embeddings, which are an integral part of almost any NLP model creation process. Unfortunately, it is far from always possible to use neural networks to work with this type of data — the reason, for example, maybe a low fitting or inference rates.

I want to suggest an interesting way to use gradient boosting that few people know about.

August 21, 2021

Finding the Story in the EU Fishing Rights Data

As Brexit trade negotiations were dragging on at the start of the year, a lot of the discourse focused on perceived inequities in fishing rights. I felt there was a story in the data that could add depth and detail to the narrative. Despite having the largest Exclusive Economic Zone (EEZ) of all EU countries, and some of the richest fishing grounds, UK fleets are restricted to relatively modest catches.

The Common Fisheries Policy provides EU states with mutual access to each other's fishing grounds but sets quotas based largely on catch figures from 40 years ago, which today seem arbitrary. Earlier this year, the UK government was pushing to reverse this by proposing a "zonal attachment" model, where quotas would be carved up relative to the abundance of fish in each country's waters.

June 30, 2021

The Importance of Defining Fairness for Decision-Making AI Models

Defining fairness is a problematic task. The definition depends heavily on context and culture and when it comes to algorithms, every problem is unique so will be solved through the use of unique datasets. Algorithmic fairness can stem from statistical and mathematical definitions and even legal definitions of the problem at hand. Furthermore, if we build models based on different definitions of fairness for the same purpose, they will produce entirely different outcomes.

The measure of fairness also changes with each use case. AI for credit scoring is entirely different from customer segmentation for marketing efforts, for example. In short, it’s tough to land on a catch-all definition, but for the purpose of this article, I thought I’d make the following attempt: An algorithm has fairness if it does not produce unfair outcomes for, or representations of, individuals or groups.

June 13, 2021

The Future of AI in Insurance

Artificial intelligence (AI) and machine learning (ML) have come a long way, both in terms of adoption across the broader technology landscape and the insurance industry specifically. That being said, there is still much more territory to cover, helping integral employees like claims adjusters do their jobs better, faster, and easier.

Data science is currently being used to uncover insights that claims representatives claim wouldn't be available otherwise, and which can be extremely valuable. Data science steps in to identify patterns within massive amounts of data that are too large for humans to comprehend on their own; machines can alert users to relevant, actionable insights that improve claim outcomes and facilitate operational efficiency.

June 8, 2021

Exploring the Fundamentals of Binary Serialized Data Structures

I usually solve problems by letting them devour me.

–Franz Kafka

May 22, 2021

Delight: The New and Improved Spark UI and Spark History Server

Delight is a free, hosted, cross-platform monitoring dashboard for Apache Spark. It's a great complement to the Spark UI that can help you understand and improve the performance of your Spark applications.

Delight gives you access to:

May 6, 2021

Why Fairer AI Is Essential For Long-Term Survival

Introduction

In my most recent post, I covered some areas that I hope to see evolve in the next year and beyond. How we can do more with data across industries is, of course, an important consideration for data scientists, businesses, and society as a whole, as better models lead to improved products and services.

When machine learning models for cancer diagnoses show promise, we naturally rally around this positive step and rejoice in the vision of a brighter future because it’s a victory that touches us all in some way. But there are many other ways AI can and must be used for good in the world, and in my next few posts, I want to use a financial services example that affects all of us, to show how that can be achieved.

May 5, 2021

How to Create a Mosaic Chart Using JavaScript

Data visualization is a valuable tool in today’s scenario, with data everywhere and various opportunities to tap into that data to find insights. Visual charts are imperative to communicate ideas, identify patterns, and make the most of the available data.

So then, would you like to quickly and easily learn how to create a really cool chart that showcases data interestingly?

April 30, 2021May 16, 2021

5 Customer Data Integration Best Practices

For the last few years, you have heard the terms "data integration" and "data management" dozens of times. Your business may already invest in these practices, but are you benefitting from this data gathering?

Too often, companies hire specialists, collect data from many sources and analyze it for no clear purpose. And without a clear purpose, all your efforts are in vain. You can take in more customer information than all your competitors and still fail to make practical use of it.

April 20, 2021

Graph-Based Data Science, Machine Learning, and AI

Introduction

Over the last few years, we have seen what was once a niche research topic —graph-based machine learning—snowball. The Year of the Graph was among the first to take stock, point towards this development, and recognize graph-based AI as a key pillar for future development in the field.

In this edition of the YotG Newsletter, we highlight resources focused on graph-based machine learning and data science. This is not to say that there's a lack of news on graph analytics, graph databases, and knowledge graphs — rather, the opposite is true.

April 16, 2021

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

April 9, 2021

Top 21 Data Mining Tools

Data mining is a world itself, which is why it can easily get very confusing. There is an incredible number of data mining tools available in the market. However, while some might be more suitable for handling data mining in Big Data, others stand out for their data visualization features.

As is explained in this article, data mining is about discovering patterns in data and predicting trends and behaviors. Simply put, it is the process of converting vasts sets of data into relevant information. There is not much use in having massive amounts of data if we do not actually know what it means.