data science | The Blog Pros

January 30, 2020

A Practical Way to Think About Prediction Accuracy

One of the common questions that gets asked by management when trying to deploy is, "What is the accuracy?" That is the trap companies tend to get into for wanting the best accuracy to go live.

When talking about accuracy, it's important to compare the accuracy that your model provides in comparison to what you do now without the model.

January 30, 2020

How Do You Measure If Your Customer Churn Predictive Model Is Good?

Accuracy is a key measure that management looks at it before giving a green light to take the model to production. This article talks about the practical aspect of what to measure and how to measure. Please refer to this article to learn about the common mistakes made in measuring accuracy.

Two Important Points to Consider When Measuring Accuracy

The data to use when measuring the accuracy should not have been used in the training. You can split your data into 80% and 20%. Use 80% to train and use the rest — 20% — to predict and compare the predicted value with the actual outcome to define the accuracy
One outcome eclipsing the other outcome. Say 95% of your transactions are not fraud. If the algorithm marks every transaction is not fraud, its right 95% of the time. So the accuracy is 95% but the 5% its wrong can break the bank. In those scenarios, we need to deal with other metrics such as Sensitivity & Specificity etc. which we will cover in this article in a practical way.

Problem Definition

The goal for this predictive problem is to identify which customers would churn. The dataset has 1000 rows. Use 80% sample (800 rows) for training and the 20% of the Data to measure accuracy. (200 rows). Say we have trained the model with 800 rows and predicting on the 200 rows.

January 24, 2020

RPA + Machine Learning = Intelligent Automation

Robotic process automation has generated a lot of buzz across many different industries. As businesses focus on digital innovation, automation of repetitive tasks to increase efficiency while decreasing human errors is an attractive proposition.

Robots will not tire, will not get bored, and will perform tasks accurately to help their human counterparts improve productivity and free them up to focus on higher level tasks.

January 15, 2020January 28, 2020

Build Your First Python Chatbot Project

Introduction

Chatbots are extremely helpful for business organizations and also the customers. The majority of people prefer to talk directly from a chatbox instead of calling service centers. Facebook released data that proved the value of bots. More than 2 billion messages are sent between people and companies monthly. The HubSpot research tells us that 71% of people want to get customer support from messaging apps. It is a quick way to get their problems solved so chatbots have a bright future in organizations.

Today we are going to build an exciting project on Chatbot. We will implement a chatbot from scratch that will be able to understand what the user is talking about and give an appropriate response.

December 26, 2019

Accelerated Extract-Load-Transform Data Pipelines

As a columnar database with both strong CPU and GPU performance, the OmniSci platform is well suited for Extract-Load-Transform (ELT) pipelines (as well as the data science workloads we more frequently demonstrate). In this blog post, I’ll demonstrate an example ELT workflow, along with some helpful tips when merging various files with drifting data schemas. If you’re not familiar with the two major data processing workflows, the next section briefly outlines the history and reasoning for ETL-vs-ELT; if you’re just interested in the mechanics of doing ELT in OmniSci, you can skip to the “Baywheels Bikeshare Data” section.

A Brief History of ETL vs. ELT for Loading Data

From the first computerized databases in the 1960s, the Extract-Transform-Load (ETL) data processing methodology has been an integral part of running a data-driven business. Historically, storing and processing data was too expensive to be accumulating data without knowing what you were going to do with it, so a process, such as the following. would occur each day:

December 18, 2019

Making Data Scientists Productive in Azure

Doing data science today is far more difficult than it will be in the next 5 to 10 years. Sharing and collaborating on workflows in painful, pushing models into production is challenging. Let’s explore what Azure provides to ease data scientists’ pains.

In this post, you will learn about the Azure Machine Learning Studio, Azure Machine Learning, Azure Databricks, Data Science Virtual Machine, and Cognitive Services. What tools and services can we choose based on a problem definition, skillset, or infrastructure requirements?

December 12, 2019December 23, 2019

Using R on Jupyter Notebook

Overview

R is an interpreted programming language for statistical computing and graphics supported by the R Foundation. It is widely used among statisticians and data miners for developing statistical software and data analysis.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows, and macOS.

December 10, 2019

A Beginner’s Guide to Machine Learning: What Aspiring Data Scientists Should Know

A Beginner's Guide to Machine Learning

Before choosing a machine learning algorithm, it's important to know their characteristics to generate desired outputs and build smart systems.

Data science is growing super fast. As the demand for AI-enabled solutions is increasing, delivering smarter systems for industries has become essential. And the correctness and efficiency through machine learning operations must be fulfilled to ensure the developed solutions complete all demands. Hence, applying machine learning algorithms on the given dataset to produce righteous results and train the intelligent system is one of the most essential steps from the entire process.

November 21, 2019

The Complete Data Science LinkedIn Profile Guide

Why Data Scientists Should Be Using LinkedIn

To date, there are more than 830,000 data science LinkedIn profiles registered worldwide. Despite this number of Data Scientists available/in roles online currently, it’s no secret there is still a major talent shortage. In fact, according to a report by O’Reilly Media, nearly half of all European companies are struggling to fill data science positions. Studies performed by Indeed’s Hiring Lab show an overall increase of 256% in data science job openings since 2013, with an increase of 31% year-over-year as recently as December 2018. Data science is a vast, complex industry with many subsets.

Variations in roles oftentimes require such specific skillsets that positions are left unfilled for an average of up to 45 days. So, what does this mean for you as someone in data science, engineering, or machine learning? You’re a hot commodity. There are start-ups, unicorns, and conglomerates that will want to work with you. We can guarantee it. For recruitment specialists, they want to be able to identify candidates who can offer organisations a unique set of skills. It’s imperative you optimize your skillsets on LinkedIn, and you should start now!

November 15, 2019

A Complete Guide To Math And Statistics For Data Science

As Josh Wills once said,

“Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.”

Math and Statistics for Data Science are essential because these disciples form the basic foundation of all the Machine Learning Algorithms. In fact, Mathematics is behind everything around us, from shapes, patterns, and colors, to the count of petals in a flower. Mathematics is embedded in each and every aspect of our lives.

November 12, 2019

109 Data Science Interview Questions and Answers

Preparing for an interview is not easy. There is significant uncertainty regarding the data science interview questions you will be asked. No matter how much work experience or what data science certificate you have, an interviewer can throw you off with a set of questions that you didn’t expect.

During a data science interview, the interviewer will ask questions spanning a wide range of topics, requiring both strong technical knowledge and solid communication skills from the interviewee. Your statistics, programming, and data modeling skills will be put to the test through a variety of questions and question styles that are intentionally designed to keep you on your feet and force you to demonstrate how you operate under pressure.

November 6, 2019

Turning Your Raspberry Pi Into a Science Research Station Via BOINC

Use your computing power for the greater good...

BOINC lets you help cutting-edge science research using your computer (Windows, Mac, Linux) or Android device. BOINC downloads scientific computing jobs to your computer and runs them invisibly in the background. It’s easy and safe, you can check it out here.

This is the entry paragraph on the Berkely University BOINC website: "The BOINC software, short for Berkeley Open Infrastructure for Network Computing, can also be installed on Raspberry Pis, making your Raspberry Pi your own little science research station. This way, you can help science projects such as SETI@home, Einstein@Home, Universe@Home, and many more."

October 16, 2019

Easily Extract Data From SQL Server for Fast and Visual Analytics With OmniSci

Extract data from SQL Server

In preparation for releasing version 5.0, OmniSci continues to become more feature-rich as customers and community members help us understand how GPUs transform their analytics and data science workloads. However, in the same way, you wouldn’t drive a racecar to the supermarket, OmniSci will never be the right tool for every data use case. Rather, we’re striving to be the leading analytics platform for the use cases we’re targeting and complementary to other best-of-breed tools in the enterprise.

To that end, OmniSci provides industry-standard connection interfaces (ODBC, JDBC), several open-source client packages (JavaScript, Python, Julia) and some OmniSci-specific utilities like SQLImporter and StreamImporter to help users bridge their legacy systems and OmniSci.

October 8, 2019

3 Essentials for Releasing Software at Speed Without Losing Quality

Delivering at speed without quality doesn't amount to much.

How to Reduce Time to Market While Maintaining Quality?

How long does it take at your company, from the time someone in sales or marketing comes up with an idea, to the time that it’s making money and adding value to your users? Let’s say it’s a simple change to your software or an added functionality in which everyone agrees that it would be an improvement. And, let’s say that the change would have to be able to support 100,000 users in a 100-minute window. You want to avoid any risk and also design and ship it so that it provides a great user experience. How long would it take to make that a reality? If you think it’s five days, for example, how much time could you possibly shave off that? A few hours? A day? Two days?

You may also enjoy: Leveraging AI and Automation for Successful DevSecOps

With any digital transformation, it’s essential to attain “quality at speed” to be able to provide quick solutions with a reduced time to market, but without sacrificing quality. In this hyper-competitive world, the difference is not in who has the best idea, but in who can implement it and bring it to market in the shortest time, in the best way, with appropriate quality.

October 2, 2019

How to Install Anaconda on ECS

Anaconda is an open source and free distribution of Python and R programming languages used for data science and machine learning related applications. Anaconda helps organizations to develop, manage, and automate AI/ML, regardless of the scale of deployment. It allows organizations to scale from individual data scientists to a collaborative group of data scientists, and from a single server to thousands of nodes for model training and deployment.

In this tutorial, we will be installing and setting up Anaconda Python on an Alibaba Cloud Elastic Compute Service (ECS) instance with Ubuntu 16.04 installed.

August 29, 2019

Why Every Organization Needs a Data Analyst

Data-driven decisions make the world go round

There is so much hype around the data scientist role these days that when a company needs a specialist to get some insights from data, their first inclination is to look for a data scientist. But is that really the best option? Let’s see how the roles of data scientists and data analysts differ and why you may want to hire an analyst before any other role.

You may also like: Five Must Read Books to Become a Successful Data Analyst.

Data Scientist or Data Analyst

So, what’s the difference between data scientists and data analysts? The definitions of these roles can vary, but it’s usually believed that a data scientist combines three key disciplines — data analysis, statistics, and Machine Learning. Machine learning involves the process of data analysis to learn and generate analytical models that can perform intelligent action on unseen data, with minimal human intervention. With such expectations, it’s clear that three-in-one is better than one-in-one, and data scientists become more desired by companies.

August 12, 2019

Book Review – Python for Programmers, by Paul Deitel and Harvey Deitel

Python for Programmers is written for those who already have some object-oriented programming background and are interested in Python, Artificial Intelligence, and Data Science.

Python for Programmers

August 6, 2019

Announcing OmniSci.jl: A Julia Client for OmniSci

Today, I’m pleased to announce a new way to work with the OmniSci platform: OmniSci.jl, a Julia client for OmniSci! This Apache Thrift-based client is the result of a passion project I started when I arrived at OmniSci in March 2018 to complement our other open-source libraries for accessing data: pymapd, mapd-connector, and JDBC.

Julia and OmniSci: Similar in Spirit and Outcomes

If you’re not familiar with the Julia programming language, the language is a dynamically-typed, just-in-time compiled language built on LLVM that can achieve or beat the performance of high-performance, compiled languages such as C/C++ and FORTRAN. With the performance of C++ and the convenience of writing Python, Julia quickly became my favorite programming language when I started using it around 2013.