Big Data | The Blog Pros

June 14, 2022

The Impact of Big Data on Promoting Digital Privacy

There is no doubt that our current state of living is based on our connection to the digital space. The 21st century has had a big shift in the majority of things online, companies are now relocating to the cloud, and we have the availability of online banking and marketing; see also, our social and personal life are all out there in the digital world.

This implies that no matter how open you are to people within your circle, there is some data or information that you keep private to yourself alone. The same is done digitally in terms of personal or business information processed using your computer or other devices online.

May 26, 2022

Data Lakes, Warehouses and Lakehouses. Which is Best?

Twenty years ago, your data warehouse probably wouldn’t have been voted hottest technology on the block. These bastions of the office basement were long associated with siloed data workflows, on-premises computing clusters, and a limited set of business-related tasks (i.e., processing payroll, and storing internal documents).

Now, with the rise of data-driven analytics, cross-functional data teams, and most importantly, the cloud, the phrase “cloud data warehouse” is nearly analogous to agility and innovation.

May 11, 2022

Businesses Discover the Shocking Cost of Bad Data

Big data has become incredibly important for many companies all over the world. Unfortunately, the growing emphasis on big data has led to some poor decision-making. Many entities are prioritizing data scalability at the expense of data quality. As a result, bad data is costing them a lot of problems.

In the USA alone, bad data - any poorly structured or managed data - costs the country over $3 trillion every year. Whether it’s created from data engineers accidentally adding an extra zero, a discrepancy in how things are formatted, or even problems with the data system itself, a lot can go wrong with data.

May 5, 2022

Querying Kafka Topics Using Presto

Presto is a distributed query engine that allows querying different data sources such as Kafka, MySQL, MongoDB, Oracle, Cassandra, Hive, etc. using SQL. It has the ability to analyze big data and query multiple data sources together.

In this article, we will discuss how Presto can be used to query Kafka topics. Below is the step-by-step process to set up Presto and Kafka, and connect them together. Here, I have considered MacOS, but similar setups can be done on any other system.

May 5, 2022

Accelerating Similarity Search on Really Big Data with Vector Indexing (Part II)

Many popular artificial intelligence (AI) applications are powered by vector databases, from computer vision to new drug discovery. Indexing, a process of organizing data that drastically accelerates big data search, enables us to efficiently query million, billion, or even trillion-scale vector datasets.

This article is supplementary to the previous blog, "Accelerating Similarity Search on Really Big Data with Vector Indexing," covering the role indexing plays in making vector similarity search efficient and different indexes, including FLAT, IVF_FLAT, IVF_SQ8, and IVF_SQ8H. This article also provides the performance test results of the four indexes. We recommend reading this blog first.

This article provides an overview of the four main types of indexes and continues to introduce four different indexes: IVF_PQ, HNSW, ANNOY, and E2LSH.

May 2, 2022

Making Machine Learning More Accessible for Application Developers

Introduction

Attempts at hand-crafting algorithms for understanding human-generated content have generally been unsuccessful. For example, it is difficult for a computer to “grasp” the semantic content of an image - e.g., a car, cat, coat, etc....… - purely by analyzing its low-level pixels. Color histograms and feature detectors worked to a certain extent, but they were rarely accurate for most applications.

In the past decade, the combination of big data and deep learning has fundamentally changed the way we approach computer vision, natural language processing, and other machine learning (ML) applications; tasks ranging from spam email detection to realistic text-to-video synthesis have seen incredible strides, with accuracy metrics on specific tasks reaching superhuman levels. A significant positive side effect of these improvements is an increase in the use of embedding vectors, i.e., model artifacts generated by taking an intermediate result within a deep neural network. OpenAI’s docs page gives an excellent overview:

April 27, 2022

Data Analysis Using Google Cloud Data Studio

Introduction

Google Cloud Data Studio is a tool for transforming data into useful reports and data dashboards. As of now, Google Data Studio has 22 inbuilt Google Connectors and 571 different Partner connectors which help in connecting data from BigQuery, Google Ads, Google Sheets, Cloud Spanner, Facebook Ads Data, Adobe Analytics, and many more.

Once the data is imported, reports and dashboards can be created by a simple drag and drop and using various filter options. Google Cloud Data Studio is out of the Google Cloud Platform, which is why it is completely free.

April 22, 2022

The Ultimate Guide to Data Collection in Data Science

In today’s world, data plays a key role in the success of any business. Data produced by your target audience, your competitors, information from the field you work and data your company gains on its own may help you find more customers, analyze your business decisions, reoptimize the business model or escalate to other markets. Data will help you define problems your business can solve and provide better service, specifying precisely your clients' needs.

According to The McKinsey Global Institute research, data-driven companies are 23 times more likely to acquire customers, six times as likely to retain customers, and 19 times as likely to be profitable.

April 21, 2022

What Is a Data Reliability Engineer, and Do You Really Need One?

As software systems became increasingly complex in the late 2000s, merging development and operations (DevOps) was a no-brainer.

One-half software engineer, one-half operations admin, and the DevOps professional are tasked with bridging the gap between building performant systems and making them secure, scalable, and accessible. It wasn’t an easy job, but someone had to do it.

April 17, 2022

How to Best Fit Filtering into Vector Similarity Search

Attribute filtering, or simply "filtering," is a basic function desired by users of vector databases. However, such a simple function faces great complexity.

Suppose Steve saw a photograph of a fashion blogger on a social media platform. He would like to search for a similar jean jacket on an online shopping platform that supports image similarity search. After uploading the image to the platform, Steve was shown a plethora of results of similar jean jackets. However, he only wears Levi’s. Then the results of image similarity search need to be filtered by brand. But the problem is when to apply the filter? Should it be applied before or after approximate nearest neighbor search (ANNS)?

April 15, 2022

Capacity and Compliance in Hybrid Cloud, Multi-Tenant Big Data Platforms

As organizations are realizing how Data-Driven insights can empower their strategic decisions and increase their ROI, the focus is on building Data Lakes and Data Warehouses where all the Big Data can be safely archived. Big data can then be used to empower various data engineering, data science, business analytics, and operational analytics initiatives to benefit the business by improving operational efficiency, reducing operating costs, and making better strategic business decisions. However, the exponential growth in the data that we humans consume and generate day to day makes it necessary to have a well-structured approach toward capacity governance in the Big Data Platform.

Introduction:

Capacity governance and scalability engineering are inter-related disciplines, as this requires a comprehensive understanding of our compute and storage capacity demands, infrastructure supply, and their inter-dynamics to develop an appropriate strategy for scalability in the big data platform. In addition to this, technical risk resolution and security compliance are equally important aspects of capacity governance.

April 15, 2022

Understanding the Database Connection Pool (DBCP) Properties

Recently, I faced an issue related to a very high load on the database layer. The database was having too many connections in parallel. I had to review my application’s database connection pool (DBCP) properties very closely. Since I was dealing with legacy code, I needed to understand the value assigned to each property and also analyze whether it is relevant for the present-day load or not. As I started looking at the properties, their values, and the consequent implications, I was able to find a decent explanation in the tomcat documentation. However, I wasn’t able to immediately map each property to the scenario where it will be used.

Since we were using Apache tomcat’s JDBC connection pool, I started reading the source code to get a better understanding. I was able to get a lot of clarity by going through the ConnectionPool class. As I didn’t find any easy resource to understand the same, I am summarizing my understanding in the form of simple flowcharts. I hope this will help others in a similar situation.

April 13, 2022

SingleStore DB Loves R

Abstract

The R programming language is very popular with many Data Scientists. At the time of writing this article, R ranks as the 11th most popular programming language according to TIOBE.

R provides many compelling data manipulation capabilities enabling the slicing and dicing of data with ease. Often data are read into and written out of R programs using files. However, R can also work with database systems. In this article, we'll see two quick examples of how Data Scientists can use R from Spark with SingleStore DB.

April 11, 2022

Explaining How Kafka Works With Robin Moffatt

In this episode of Cocktails, we talk to a senior developer advocate from Confluent about Apache Kafka, the advantages that Kafka’s distributed pub-sub model offers, how an event processing model for integration can address the issues associated with traditional static datastores, and the future of the event streaming space.

EPISODE OUTLINE

Robin Moffat tells us what he does as a Senior Development Advocate at Confluent.
We learn what Kafka offers as a distributed, pub-sub model and its advantages over the traditional request-response model.
We find out ksqlDB’s role as a database on top of Kafka and what sets it apart from other databases.
How do you facilitate a microservices architecture via event-processing?
What does the future look like for event-streaming?

April 11, 2022

How Is Data Processed in a Vector Database?

In the previous two posts in this blog series, we have already covered the system architecture of Milvus, the world's most advanced vector database, and its Python SDK and API.

This post mainly aims to help you understand how data is processed in Milvus by going deep into the Milvus system and examining the interaction between the data processing components.

April 8, 2022

Kafka Event Exchange Between Local and Azure

While it may not be a daunting task to set up Kafka on a local machine or within a particular network and produce/consume messages, people do face challenges when they try to make it work across the network.

Let’s consider a hybrid scenario where your software solution is distributed across two different platforms (say AWS and Azure or on-premise and Azure), and there is a need to route messages from Kafka cluster hosted on one platform to the one hosted on another. This could be a valid business scenario wherein you are trying to consolidate your solution with one cloud platform, and in the interim, you need to have this routing in place till you complete your migration. Even in the long term, there may be a need to maintain solution across multiple platforms for various business and technical reasons.

April 7, 2022

How Bitset Enables the Versatility of Vector Search

Various new essential features of a vector database are provided together with the release of Milvus 2.0. Among the new features, Time Travel, attribute filtering, and delete operations are correlated as these three features are achieved by one common mechanism: bitset.

Therefore, this article aims to clarify the concept of bitset in Milvus and explain how it works to support delete operations, Time Travel, and attribute filtering with three examples.

March 30, 2022

2 Billion MySQL Records

Yesterday Gary Gray, a friend of mine sent me the following screenshot. He's got a database table with 2 billion records he intends to use Magic on, and he wanted to show it to me, thinking I'd probably be like, "That's cool." Gary works for Dialtone in South Africa, and is used to handling a "monster amount of data". For the record, neither Gary nor I would want to encourage anyone to handle 2 billion records in MySQL, but as you can obviously, see it is possible.

This of course is a legacy system. If it was started today, Gary would obviously use ScyllaDB, Cassandra, or something similar. Simply counting the records in the table above requires 5 minutes of execution in MySQL workbench. Obviously, such sizes are not for those faint at heart. Such a database also implies a lot of inserts. This makes it impossible to add indexes, resulting in every select query you do towards it having some sort of where clause in it that implies a full table scan. However, it is possible.