data science | The Blog Pros

March 7, 2021

7 Best Python Libraries You Shouldn’t Miss in 2021

With more than 137,000 python libraries available today, choosing the one relevant for your project can be challenging.

Python libraries are critical if you’re looking to start a data science career. However, we will walk you through some of the best libraries that are worth learning this year.

March 3, 2021

Data Integration and ETL for Dummies (Like Me)

In early 2020, I was introduced to the idea of data integration through a friend who was working in the industry. Yes, I know. Extremely late. All I knew about it was that I could have my data in one (virtual) place and then have it magically appear in another (virtual) place. I had no clue how it was done or how important it was to modern businesses.

To give you some background, my past work experience is not in any kind of technical space. It is in business development and marketing for non-technical products. I probably should have been more aware of the technical world around me, but for now, you must forgive me for my ignorance.

February 21, 2021

Predictive Modeling in Data Science

Predictive modeling in data science is used to answer the question "What is going to happen in the future, based on known past behaviors?" Modeling is an essential part of data science, and it is mainly divided into predictive and preventive modeling. Predictive modeling, also known as predictive analytics, is the process of using data and statistical algorithms to predict outcomes with data models. Anything from sports outcomes, television ratings to technological advances, and corporate economies can be predicted using these models.

Top 5 Predictive Models

Classification Model: It is the simplest of all predictive analytics models. It puts data in categories based on its historical data. Classification models are best to answer "yes or no" types of questions.
Clustering Model: This model groups data points into separate groups, based on similar behavior.
Forecast Model: One of the most widely used predictive analytics models. It deals with metric value prediction, and this model can be applied wherever historical numerical data is available.
Outliers Model: This model, as the name suggests, is oriented around exceptional data entries within a dataset. It can identify exceptional figures either by themselves or in concurrence with other numbers and categories.
Time Series Model: This predictive model consists of a series of data points captured, using time as the input limit. It uses the data from previous years to develop a numerical metric and predicts the next three to six weeks of data using that metric.

Which Model Is Right For You?

To find out which predictive model is best for your analysis, you need to do your homework.

February 20, 2021

Top 8 Deep Learning Concepts Every Data Science Professional Must Know

“Deep learning is making a good wave in delivering a solution to difficult problems that have been faced in the field of artificial intelligence (AI) for so many years, as quoted by Yann LeCun, Yoshua Bengio & Geoffrey Hinton.”

For a data scientist to successfully apply deep learning, they must first understand how to apply the mathematics of modeling, choose the right algorithm to fit your model to the data, and come up with the right technique to implement.

February 18, 2021

How to Become a Data Engineer: A Hype Profession or a Necessary Thing

“Big Data is the profession of the future” is all over the news. I will say even more: data engineering skills for a developer is an urgent need. Before 2003, we had created as many petabytes of data as we do today every two days. Gartner analysts named cloud services and cybersecurity among the top techno trends of 2021.

The trend is easily explained. Huge arrays of Big Data need to be stored securely and processed to obtain useful information. When the companies moved to remote work, these needs have become even more tangible. E-commerce, Healthcare, EdTech — all these industries want to know everything about their online consumers. While the data is only stored on the servers, there is no sense in it at all.

February 10, 2021

How to Write ETL Operations in Python

In this article, you’ll learn how to work with Excel/CSV files in a Python environment to clean and transform raw data into a more ingestible format. This is typically useful for data integration.

This example will touch on many common ETL operations such as filter, reduce, explode, and flatten.

February 5, 2021

How to Learn Data Science in 2021 (Resources to Get a Job)

I want to focus on resources you all can use to improve your technical skills and even more specifically resources to get your first job in the field. Because that’s the hardest part — getting your first job. Once you have that, you’ll learn the skills you need so fast that you won’t need people like me giving you advice.

It’s really hard to learn data science and actually be good at it because there’s a long laundry list of things you need to know to be a good data scientist. Some of those topics are:

January 28, 2021

Top 8 Future-Proof Analytics Trends for 2021

After a most unpredictable year full of black swan events, now is the time for data scientists and business analysts to humbly take the hard-learned lessons of 2020 and move forward.

This is not one of those “crystal ball” new year prediction blogs! Rather than chase buzzword mysticism of highfalutin prognostication, let us default to a tried-and-true model for charting a future-proof course forward: industry research trends.

January 23, 2021

Grakn 2.0 Alpha: Best Practices in Distributed Systems and Computer Science

Grakn is a distributed knowledge graph: a logical database to organize large and complex networks of data as one body of knowledge. Visit grakn.ai to learn more.

When we kicked off 2020, we were so excited to launch Grakn Cosmos, Grakn’s first global conference, held here in London in the first week of February. It was an intense start to the year, but we learned so much about how our community was using Grakn to solve some of the most complex problems in their industries. We met with our community members from Life Sciences, Defence and Security, Financial Services, Robotics, and many other sectors. From financial analytics to drug discovery, cyber threat detection to robotics disaster recovery, Grakn was being used to tackle a higher order of complexity in data and knowledge, and it inspired all of us at Grakn Labs.

January 15, 2021May 12, 2021

What Is Chaos Engineering?

In the past, software systems ran in highly controlled environments on-premise and managed by an army of sysadmins. Today, migration to the cloud is relentless; the stage has completely shifted. Systems are no longer monolithic and localized; they depend on many globalized uncoupled systems working in unison, often in the form of ethereal microservices.

It is no surprise that Site Reliability Engineers have risen to prominence in the last decade. Modern IT infrastructure requires robust systems thinking and reliability engineering to keep the show on the road. Downtime is not an option. A 2020 ITIC Cost of Downtime survey indicated that 98% of organizations said that a single hour of downtime costs more than $150,000. 88% showed that 60 minutes of downtime costs their business more than $300,000. And 40% of enterprises reported that one hour of downtime costs their organizations $1 million to more than $5 million.

January 2, 2021

Grouping and Sorting Records in Kumologica

Grouping and sorting of records are common functionalities seen in many microservices and integration requirements. In most use cases, the records fetched from the source system might be in raw form, with all the records treated separately, without having any records grouped as per the functional requirements. Similarly, another functionality that is popularly used is the sorting of records. This can be either sorting based on an existing property in the record, or it can be based on the computation of some properties and then sorting the records based on the result of the computation.

Grouping

Grouping is the functionality of combining records with a common property or attribute as a single unit. This can be based on one property or multiple properties associated with the records. These properties can also be referred to as keys. In the diagram below, we can see cars of different brands grouped based on the their color and type properties. Group 1 has both Ford and BMW since they have the same traits of color and type.

December 13, 2020

How Python Can Be Your Secret Weapon As a Data Scientist

Python is highly versatile and one of the most advanced programming languages in the world. There are tons of reasons why Python is getting extremely popular these days. Many experts consider it as one of the first choices in industries coming to programming languages.

Also, there have been many sayings about Python that the development of future technologies will solely rely on it. Technologies that include Data Science, AI, ML will take the driver seat to combine with Python. By adding more and more easiness in deep-driven research purposes and better product development.

December 11, 2020

5 Papers on Product Classification Every Data Scientist Should Read

Product categorization/product classification is the organization of products into their respective departments or categories. A large part of the process is the design of the product taxonomy as a whole.

Product categorization was initially a text classification task that analyzed the product’s title to choose the appropriate category. However, numerous methods have been developed which take into account the product title, description, images, and other available metadata.

The following papers on product categorization represent essential reading in the field and offer novel approaches to product classification tasks.

1. Don’t Classify, Translate

In this paper, researchers from the National University of Singapore and the Rakuten Institute of Technology propose and explain a novel machine translation approach to product categorization. The experiment uses the Rakuten Data Challenge and Rakuten Ichiba datasets.

Their method translates or converts a product’s description into a sequence of tokens which represent a root-to-leaf path to the correct category. Using this method, they are also able to propose meaningful new paths in the taxonomy.

The researchers state that their method outperforms many of the existing classification algorithms commonly used in machine learning today.

Published/Last Updated – Dec. 14, 2018
Authors and Contributors – Maggie Yundi Li (National University of Singapore), Stanley Kok (National University of Singapore), and Liling Tan (Rakuten Institute of Technology)

[Read Now]

2. Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models

The authors of this paper propose attention convolutional neural network (ACNN) models over baseline convolutional neural network (CNN) models and gradient boosted tree (GBT) classifiers.

The study uses Japanese product titles taken from Rakuten Ichiba as training data. Using this data, the authors compare the performance of the three methods (ACNN, CNN, and GBT) for large-scale product categorization.

While differences in inaccuracy can be less than 5%, even minor improvements in accuracy can result in millions of additional correct categorizations.

Lastly, the authors explain how an ensemble of ACNN and GBT models can further minimize false categorizations.

Published/Last Updated – April, 2017 for EACL 2017
Authors and Contributors – From the Rakuten Institute of Technology: Yandi Xia, Aaron Levine, Pradipto Das Giuseppe Di Fabbrizio, Keiji Shinzato and Ankur Datta

[Read Now]

3. Atlas: A Dataset and Benchmark for Ecommerce Clothing Product Classification

Researchers at the University of Colorado and Ericsson Research (Chennai, India) have created a large product dataset known as Atlas. In this paper, the team presents its dataset which includes over 186,000 images of clothing products along with their product titles.

December 10, 2020

An Introduction to 5 Types of Image Annotation

Looking for information on the different image annotation types? In the world of AI and machine learning, data is king. Without data, there can be no data science. For AI developers and researchers to achieve the ambitious goals of their projects, they need access to enormous amounts of high-quality data. In regards to image data, one major field of machine learning that requires large amounts of annotated images in computer vision.

Table of Contents

December 8, 2020

‘mapPartitions’ in Apache Spark: 5 Key Benefits

'mapPartitions' is the only narrow transformation, being provided by Apache Spark Framework, to achieve partition-wise processing, meaning, process data partitions as a whole. All the other narrow transformations, such as map, flatmap, etc. process partitions record-wise. 'mapPartitions', if used judiciously, can speed up the performance and efficiency of the underlying Spark Job manifold.

'mapPartitions' provides an iterator to the partition data to the computing function and expects an iterator to a new data collection as the return value from the computing function. Below is the 'mapPartitions' API applicable on a Dataset of type <T> expecting a functional interface of type 'MapPartitionsFunction' to process each data partition as a whole along with an Encoder of the type <U>, <U> being representing the returned data type in the returned Dataset.

November 20, 2020

Identify and Resolve Stragglers in Your Spark Application

Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.

What Is a Straggler in a Spark Application?

A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A straggler Task takes an exceptionally high time for completion as compared to the median or average time taken by other tasks belonging to the same stage. There could be multiple stragglers in a Spark Job being present either in the same stage or across multiple stages.

November 17, 2020

What Is ETLT? Merging the Best of ETL and ELT Into a Single ETLT Data Integration Strategy

Data integration solutions typically advocate that one approach – either ETL or ELT – is better than the other. In reality, both ETL (extract, transform, load) and ELT (extract, load, transform) serve indispensable roles in the data integration space:

ETL is valuable when it comes to data quality, data security, and data compliance. It can also save money on data warehousing costs. However, ETL is slow when ingesting unstructured data, and it can lack flexibility.
ELT is fast when ingesting large amounts of raw, unstructured data. It also brings flexibility to your data integration and data analytics strategies. However, ELT sacrifices data quality, security, and compliance in many cases.

Because ETL and ELT present different strengths and weaknesses, many organizations are using a hybrid “ETLT” approach to get the best of both worlds. In this guide, we’ll help you understand the “why, what, and how” of ETLT, so you can determine if it’s right for your use-case.

November 6, 2020

Four Ways to Filter a Large-Sized Spark Dataset Against a Data Collection

Let us assume there is a large-sized Dataset, ‘A’, having the following schema:

    Java
   
          x
         
root:
| — empId: Integer
| — sal: Integer
| — name: String
| — address: String
| — dept: Integer