Easier Data Science Development With Prodmodel

Data science development is an experimental and iterative process. It involves a lot of trial and error and it's easy to lose track of what's been tested and what hasn't. The following examples show how Prodmodel — an open-source data engineering tool I developed — helps to solve some of those problems. It works with Python 3.5 or above.

The idea behind Prodmodel is to structure your modeling pipeline as Python function calls. The tool then versions, caches, and reuses the objects returned by these functions. This way you don't have to keep in mind the various data or model files or pieces of codes you're experimenting with.

Developers and Databases

To learn about the current and future state of databases, we spoke with and received insights from 19 IT professionals. We asked, "What advanced database knowledge or skills do developers need?" Here’s what they shared with us:

Conceptual

  • More development is happening from top to bottom. Full stack development is how developers need to build things. All vendors like AWS and Azure have tools to enable. Database developers are becoming data engineers. Learn basic stats. Have breadth around Java, Go, Python, SQL. Have the knowledge you are building a service to help a customer deliver a solution to a problem.
  • App developers need to have a rudimentary understanding of how the data is being stored and accessed. How is the database being used and what are you trying to achieve with the application? Select the database based on the requirements of the applications. What requirements does your application put on the database? Database vendors are not as forthcoming with limitations.
  • 1) Understand the product try to do some upfront work before you get into the specific things you are trying to do. Understand how the product works. Some products have clusters and nodes that need to talk to each other, so you know how to write the application. NoSQL with three different nodes in the cluster. Node one may work but node two and three may not. A lot of developers think the data is the easy part and it’s not. It’s the part you cannot rewrite.
  • Being able to determine the right tool for the job is the knowledge that more developers need. The database landscape is flooded with solutions (some claim to be a jack-of-all-trades, while others have a more niche focus). But is the solution you picked going to be the right tool for the job? We see people pick Apache Cassandra because it’s scalable and reliable, but in talking to them it’s clear that a more traditional relational database would make more sense. And vice versa, where companies are hitting limits on their relational database and have a perfect use case for Cassandra. Either way, it’s important for more developers to hone skills around the fine details between different databases and knowing (with confidence) which one is most applicable to a specific use case.
  • In a world of managed services, developers should care less about deep database knowledge. This allows them to focus on the apps that differentiate their businesses and unlock new revenue streams from the data they own. They shouldn’t have to be experts on database management, but rather in building data-driven apps.
  • Developers need to understand how to interact with databases. Understanding database architectures, database design, access paths/optimizations and even how data is arranged on disk or in memory is useful, particularly if you're a backend developer. The following are important: Basic data modeling techniques, Normalization vs Denormalization, Understand SQL when to use foreign keys., Understanding execution plans., Using prepared statements., The different types of joins available, depending on the database., Understanding data obfuscation and encryption. Nowadays, developers tend to reference the database APIs to retrieve data, so they assume they don’t need database knowledge. This is not the case.
  • 1) Understand conceptually what’s going on things should be much easier for application architecture. Databases are providing easier integrations and streaming even if building on-prem. Choose the database for the business application logic. Expect more with the advancement of the database technology and cloud simplifies from an operational perspective. 2) Be familiar with different languages and paradigms for different use cases — analytic application, operational application, transactional application. NodeJS for a web app, Python for data-driven application, applications in Java. Use the right language and platform. 3) How can I leverage AI/ML as part of my application to create smarter and more intelligent applications? 4) Microservices-based architecture like event-driven application. 5) Containerization to speed up the process of build, test, and release software.

Data Science

  • Cloud is a big skill that developers should have. We’re training all of our developers on the cloud and cloud technology. More databases will be deployed on the cloud with many vendors. Know solution architecture on cloud, machine learning even if you’re not designing algorithms of doing ML as part of your day-to-day job developers will need to know how to implement ML tools and tool kits in the future. Seeing some coalescence of data science and developers. Speak the same language. Reduce wasteful cycles in a different language. A better grasp of data science concepts while scientists understand data and database management.
  • Understanding the power of distribution and being able to adopt data structures that match such infrastructures. It will also be important to have some know-how when it comes to data science, as understanding data will be even more important (especially when it comes to real-time data, which raises the next important piece of knowledge – real-time tooling). There are a lot of messaging queues and distributed processing frameworks that work well with databases and are required to handle the load. And last but not list, machine learning skills will become increasingly important.

Other

  • What kind of storage and persistence do you need to solve the problem you are trying to solve?
  • 1) Education no matter your title, research technology and get excited about it, spend the time to learn about the technology but stay fluid about tools that can be used to solve problems. 2) Think long and hard about the application release process and know there are ways to bring databases into the workflow. 3) Take the time to challenge the old tradition of doing things and reimagine what the database process looks like in a DevOps workflow. Embrace the philosophy of doing it differently.
  • Developers should apply best practices when it comes to database security. The database must be secured both at rest and during transport. Shortcuts are highly discouraged when it comes to database protection/security.
  • Governance a lot of people forget the importance of best practices in governance. Tests for CI and unit tests, checking code, using case tracking tools, all the best practices. Modeling given that developers have driven the NoSQL revolution to understand how data is modeled and how to model should be part of what you choose in a database. What kind of relationship do you want to discover and nodes you want to express? It’s important for companies to reassess how they assess database technologies with 300+ databases out there a lot of basic things that weren’t on the radar ten years ago are important today. You can’t take for granted that new database technologies have the security, performance, reliability, and trustability you need and want.
  • Developers need to understand SQL to be able to run powerful and complex queries. SQL is still the lingua franca of the data world; the appearance of NoSQL and Hadoop did not diminish its utility. SQL's advantage is that it is an industry standard, with an entire ecosystem and many learning resources around it. But it is a complex language that requires time to learn.
  • What should vendors be doing – abstract away the complexity. Provide developers with the right kind of RESTFUL APIs and SDKs. Abstract away GPUs. Developers need to understand the roles of the new types of people in the organization. What goes into creating an ML algorithm. Understand the lineage of how they came to be. More collaboration between the data and the development team. Break down barriers.
  • Technical aspects are non-quantitative. Place more importance on the soft skills how to better interact with your customers, the business side of the house, to get determine and meet their needs in a shorter timeframe.

Here are the contributors of insight, knowledge, and experience:

Software Ate the World and Now the Models Are Running It

Along with our data ecosystem partners, we are seeing unprecedented demand for solutions to complex, business-critical challenges in dealing with data.

Consider this. Data Engineers walk into work every day knowing they’re fighting an uphill battle. The root of the problem – or at least one problem – is that modern data systems are becoming impossibly complex. The burgeoning amount of data being processed in organizations today is staggering, where annual data growth is often measured in high double-digit percentages. Just a year ago, Forbes reported that 90% of the world’s data was created in the previous two years.

The Types of Data Engineers

Overview

We all know that in the last few years the position of data engineer, together with data science, has been in high demand on the market.

However, we can still observe in the market a certain discrepancy in the technical profile of a data engineer. I’m talking about this point specifically for the Latin American region, maybe elsewhere in the world this is more advanced.

Augmented Analytics: The Future of Data and Analytics

With the rising need and importance for data, many next generation technologies and data processing tools are coming into the spotlight. Today, becoming data-driven is a key priority for many advanced organizations. In order to sustain a good position in the industry, organizations need to adopt an advanced data processing tool such as augmented analytics.

Augmented analytics uses Artificial Intelligence (AI) and machine learning to augment human efforts to evaluate data. It beats the traditional analysis tools by automating data insights and providing clearer information. According to Forbes, 89% of industry leaders believe that Big Data will transform business operations in the same way the Internet did. Also, enterprises that don’t implement a business intelligence (BI) strategy to gather, evaluate, and apply that information in a meaningful way will be left in the dust. Here’s where an advanced data analytical tool like augmented analytics comes into the picture. According to a report by Allied Analytics, due to the growing adoption of next-generation technologies, such as augmented analytics, the global augmented analytics market size is expected to reach $29 million by 2025.

Practical Strategies to Handle Missing Values

One of the major challenges in most BI projects is to figure out a way to get clean data. 60 to 80 percent of the total time is spent on cleaning the data before you can make any meaningful sense of it. This is true for both BI and Predictive Analytics projects. To improve the effectiveness of the data cleaning process, the current trend is to migrate from the manual data cleaning to more intelligent machine learning-based processes.

Identify the Type of Missing Values We Are Dealing With

Before we dig into figuring out how to handle missing values, it's critical to figure out the nature of the missing values. There are three possible types, depending on if there exists a relationship between the missing data with the other data in the dataset.

Job Hunting in the Age of AI: How to Upskill for the 5 Hottest New Jobs

You could worry about the jobs AI will obliterate or focus on the exciting new jobs it will create. The latter will take you places.

AI is transforming global job markets. From reshaping career paths to developing new markets, it is an exciting time for people who wish to learn new skills and persevere. A report from the World Economic Forum (WEF) states that AI will create 58 million new jobs by 2022. Those who wish to capitalize on this enormous opportunity need to focus on reskilling and upskilling and take a proactive approach to learning so they can land some of the most sought-after jobs in the modern AI era.

12 Quotes That Question AI in 2019

2018’s 5th biggest barrier to digital transformation within global companies was “immature digital culture.” What is your level of understanding of one of its major components, artificial intelligence, in regards to its issues and opportunities?

Do you think 2018 was the year of artificial intelligence, or do you think 2019 will be its year? Perhaps we are now in the middle of the AI era.

Why Open Data Is Key to Solving Global Challenges

The launch of the Global Risks Report from the World Economic Forum outlined the importance of cooperation when trying to solve problems that are inherently global in nature. The report warned however that the willingness for such cooperation was dwindling as states entered a dog-eat-dog mindset.

A new report from UK academia suggests the key to global problems is not just global cooperation per se, but specifically open data. The report, which was penned by the Open Research Data Task Force, which itself is a group of senior professors from higher education, highlights how open research data significantly increases the likelihood that science will be able to infer patterns and identify solutions in complex problems.

How I Built the Perfect Data Science Team

When I assembled my first data science team, the term was barely getting printed in the Harvard Business Review. I had no clue that I was building a team pioneering in Big Data and data science. Now is a good time to reflect on this story that started twelve years ago.

At first, I really wanted to title this article “How I built the perfect data science team (without knowing it).” However, I did not want to give the impression I did not know what I was doing (I think I did). Nevertheless, here is my story…

Big Data Suggests War Might Be Lurking Around The Corner

Big data has proven remarkably effective at predicting a great many things in recent years, but perhaps one of the more maudlin examples comes via a recently published paper from the University of Florence, which used big data to predict the likelihood of war.

The researchers wanted to test the hypothesis that war is just something mankind does, it's hard-wired into us to fight one another, and therefore war should be something we can expect to occur reasonably consistently.

Normal and Laplace Distributions in Differential Privacy

I heard the phrase “normal approximation to the Laplace distribution” recently and did a double take. The normal distribution does not approximate the Laplace!

Normal and Laplace Distributions

A normal distribution has the familiar bell curve shape. A Laplace distribution, also known as a double exponential distribution, it pointed in the middle, like a pole holding up a circus tent.

Data Science for Decision Makers

Introduction

In this article, I'm interviewing a veteran data scientist, Dr. Stylianos (Stelios) Kampakis, about his career to date and how he helps decision makers across a range of businesses understand how data science can benefit them.

While data science is a field showing immense growth at present, it's somewhat nebulous in its description. I think there's a lot of uncertainty as to exactly what it is and how to apply it. Fortunately, Stelios is an expert data scientist with a mission to educate the public about the power of data science and AI. He is a member of the Royal Statistical Society, an honorary research fellow at the UCL Centre for Blockchain Technologies, and CEO of The Tesseract Academy. A natural polymath, with a Ph.D. in Machine Learning and degrees in Artificial Intelligence, Statistics, Psychology, and Economics, he loves using his broad skillset to solve difficult problems and help companies improve their efficiency.

Entropy Extractor Used in μRNG

Last time, I mentioned μRNG, a true random number generator (TRNG) that takes physical sources of randomness as input. These sources are independent but non-uniform. This post will present the entropy extractor μRNG uses to take non-uniform bits as input and produce uniform bits as output.

We will present Python code for playing with the entropy extractor. (μRNG is extremely efficient, but the Python code here is not; it's just for illustration.) The code will show how to use the pyfinite library to do arithmetic over a finite field.

Integration of Apache NiFi and Cloudera Data Science Workbench for Deep Learning Workflows

Summary

Now that we have shown that it is easy to do standard NLP, next up is Deep Learning. As you can see, NLP, Machine Learning, Deep Learning, and more are all in your reach for building your own AI as a Service using tools from Cloudera. These can run in public or private clouds at scale. Now you can run and integrate machine learning services, computer vision APIs, and anything you have created in-house with your own Data Scientists. The YOLO pre-trained model will download the image to /tmp from the URL to process it. The Python 3 script will also download the GLUONCV model for YOLO3.

Using Pre-trained Model:

Using Cloudera Data Science Workbench With Apache NiFi

Using Deployed Models as a Function as a Service

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP, but it will work for all CDSW regardless of install type.

In my simple example, I built a Python model that uses TextBlob to run sentiment against a passed sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.

What Is Data Mining?

Everyone wants an edge. And in the digital age of business, the greatest strategic advantage comes from slicing, dicing, and analyzing data from every possible angle.

Data mining is the automated process of sorting through huge data sets to identify trends and patterns and establish relationships. And as enterprise data proliferates — now over 2.5 quintillion bytes per day — it'll continue to play an increasingly important role in the way businesses plan their operations and address challenges in the future.