unstructured data | The Blog Pros

February 18, 2022

Data Lives Longer Than Any Environment, Data Management Needs to Extend Beyond the Environment

Komprise enables enterprises to analyze, mobilize and monetize file and object data across clouds, data centers, and the edge. The solution constantly monitors key business services, identifies changes in usage patterns, and automatically captures new insights. Komprise also simplifies access to all enterprise data helping companies make better decisions faster while driving increased revenue from existing infrastructure

The 41st IT Press Tour had the opportunity to meet with Kumar Goswami, co-founder and CEO, Darren Cunningham, VP of Marketing, Ben Conneely, VP EMEA Sales, Krishna Subramanian, Co-Founder and COO of Komprise.

February 13, 2022

GitHub Is Bad for AI: Solving the ML Reproducibility Crisis

There is a crisis in machine learning that is preventing the field from progressing as fast as it could. It stems from a broader predicament surrounding reproducibility that impacts scientific research in general. A Nature survey of 1,500 scientists revealed that 70% of researchers have tried and failed to reproduce another scientist’s experiments, and over 50% have failed to reproduce their own work. Reproducibility, also called replicability, is a core principle of the scientific method and helps ensure the results of a given study aren’t a one-off occurrence but instead represent a replicable observation.

In computer science, reproducibility has a more narrow definition: Any results should be documented by making all data and code available so that the computations can be executed again with the same results. Unfortunately, artificial intelligence (AI) and machine learning (ML) are off to a rocky start when it comes to transparency and reproducibility. For example, take this response published in Nature by 31 scientists that are highly critical of a study from Google Health that documented successful trials of AI that detects signs of breast cancer.

October 19, 2021

Here’s How You Can Purge Big Data From Unstructured Data Lakes

Without a doubt, big data is becoming the biggest data with the passage of time. It’s going above and beyond. Here are some pieces of evidence as to why I said it.

According to the big data and business analytics report from Statista, the global cloud data IP traffic will reach approximately 19.5 zettabytes in 2021. Moreover, the big data market will strike a figure of 274.3 billion US dollars with a five-year compound annual growth rate (CAGR) of 13.2% by 2022. Plus, Forbes predicted that over 150 trillion gigabytes or 150 zettabytes of real-time data will be required by the year 2025. Also, Forbes found that more than 95% of the companies need some assistance for the management of unstructured data, while 40% of the organizations affirmed that they need to deal with big data more habitually.

June 13, 2021

The Future of AI in Insurance

Artificial intelligence (AI) and machine learning (ML) have come a long way, both in terms of adoption across the broader technology landscape and the insurance industry specifically. That being said, there is still much more territory to cover, helping integral employees like claims adjusters do their jobs better, faster, and easier.

Data science is currently being used to uncover insights that claims representatives claim wouldn't be available otherwise, and which can be extremely valuable. Data science steps in to identify patterns within massive amounts of data that are too large for humans to comprehend on their own; machines can alert users to relevant, actionable insights that improve claim outcomes and facilitate operational efficiency.

August 31, 2020

How Milvus Realizes the Delete Function

This article deals with how Milvus implements the delete function. As a much-anticipated feature for many users, the delete function was introduced to Milvus v0.7.0. We did not call remove_ids in FAISS directly, instead, we came up with a brand new design to make deletion more efficient and support more index types.

In FAISS, deleting an ID and its corresponding vector requires going through the whole dataset to determine which vectors to remove (facebookresearch/faiss). Thus, frequently calling remove_ids greatly worsens system performance and makes deleting and searching at the same time impossible. Furthermore, to delete the data that is flushed to the disk, you need to load it to the memory before flushing it back to the disk. This is prohibitively pricy in terms of system consumption and is obviously not viable for a production environment. Besides, FAISS only supports deletion for FLAT, IVF_FLAT, and IDMAP. Our goal for Milvus is to support deletion for not only all CPU and GPU indexes in FAISS, but also other ANNS libraries going forward. Therefore, we must design a new delete function for Milvus.

January 14, 2020

What CDOs and CAOs Struggle With Most

Our team recently attended the Chief Data & Analytics officers (CDAO) conference in Boston and used the opportunity to conduct an informal poll. The conference wills packed with C-suit executives trying to wrangle big data at companies like Tesla, Lionsgate, AMD, Capital One, and Ford. We asked everyone about their analytics challenges. There were two standout issues that we kept hearing about again and again.

1. Their data scientists get bogged down with data access challenges

A recent study showed that data preparation and data engineering tasks represent over 80% of the time consumed in most AI and Machine Learning projects.

May 21, 2019

How to Handle the Influx of Data

To learn about the current and future state of databases, we spoke with and received insights from 19 IT professionals. We asked, "How can companies get a handle on the vast amounts of data they’re collecting?" Here’s what they shared with us:

Ingest

It’s incredibly important to ingest, store, and present it for querying. We have a lambda architecture for in-memory processing, streaming, analytics, and then very scalable data at rest for historical data. When people struggle, they’ve figured out one piece of the puzzle. They may be able to ingest data quickly, but they are not able to analyze the data and get insights. It’s all about being able to capture the data and then do valuable things with it at the same time.
Have an Agile data architecture. We have perfected the collection of data with data ingestion solutions like Spark and Kinesis. But there are still a lot of challenges remaining in analyzing and operationalizing the data. There is not enough scale and investment going on in those two areas. Focus on concepts like federated query. Data can reside anywhere. Optimize compute to understand where the data lives so you can produce fast results. Data labs give people their own sandbox to work on data that exists and bring compute to where the data resides.
We handle data at a high level with governance based on where data is coming from, it’s structure, and where it’s going. With things like GDPR, this has become more important. Ingesting data streaming in real-time is key. Stream-based ingestions with volume and noise are increasing. Bring in other technologies like Kafka to ingest. Multiplatform offer “horses for courses.”

Query

The data management problem is solved with an overarching data management solution. Consider what data needs to be stored, for how long and at what granularity. For example, in banking, with mobile access, a lot of customers look at their balances when they are bored. Because we’re in-memory we can cache balance information so it’s cheap and easy for customers to get to.
Be able to securely store large amounts of data. Companies are using the cloud to do this because they do not have to pre-provision resources. They typically store this data in object stores like Amazon S3 or Google Cloud Storage. The second challenge is to derive value from these data sets; much of the value stays inaccessible because there is no way to query the raw data. A developer has to massage the data using various data pipelines before he/she can unlock the value of this data, and this transformation typically uses its own custom APIs. Databases make it easy to query these data sets. Databases associate a schema with the data, either at read time or write time, and make it accessible by a developer via a very standard query language like SQL. New-age databases can continuously ingest data from cloud services, like Amazon S3, Google Cloud or DynamoDB, and make it queryable via standard SQL. This makes it easier for a developer to extract value from large sets of data.
1) Auditing is probably the first step. Understand what the data is, its origin, and destination. Then marry this with the overall strategy as the business and figure out whether vital data exists, whether it should be archived or whether it needs enrichment to produce meaningful data. 2) In a previous life, the first task was to run tools that would scan the network and find instances of running databases. In some cases, customers had several copies of the same data being processed by different systems costing vast amounts in infrastructure and resources. No one was using this data. This goes back to designing databases with a purpose in mind. 3) Stream Processing can play a huge part. Being able to validate, classify and enrich data, you can add context and meaning. That way you can determine how much value it may have to you. Stream processing enables organization and context, which in turn enables understanding.
Active analytics platform enables clients to handle data and access streaming and historical data using SQL queries. We are now able to involve graph relationship queries, also recognize the opportunity to use trained ML algorithms to run against the active analytics database.

Other

Delete it as fast as you possibly can. The types of customers that can and cannot delete data vary by industry. Healthcare, aerospace, finance must preserve data. Are you going to archive? Real-time, or near-time available? Do you put it in a warehouse? Is the database transactional? How up to date does the data need to be? Real-time, near real time? Balance a transactional system at run time against the analytics customers want to run. RDBMS or Elk stack? A database is a tool, don’t abuse it. Have a strategy around the data, long-term and short-term problems to address. Get it right early or it just gets more difficult.
Be more intelligent about how you will use the data to do novel things. Accelerate database releases to provide knowledge to the business more quickly. Be smart about equipping the right individuals to have control over their destiny. People are moving away from the monolith. Choose the right technology based on what you are trying to achieve. There are more tools today with greater specialization. Let teams chase after and test different solutions so they benefit from processing all of this.
It’s a challenging task to get a handle on data collection, but it’s even more challenging to provide data access. Database technologies, such as data indexing, data normalizing, and data warehousing, allow companies to systematically store and retrieve data as efficiently and effectively as possible.
If you collect meaningful data that you expect to be able to sort, categorize and report form it should be stored in a database! And your database strategy will be key to your operational efficiency.
Databases are part of the solution. Choose a data storage product based on how to get the data in and how to query it. In terms of value, it comes down to how much you need to scale out to avoid performance hits. There’s vertical and horizontal scaling. Traditional vertical databases scale well. Now the more horizontal is scaling as well. Cost is an issue. If you host in a public cloud a lot of licensing headaches are removed because the cloud vendor has worked out the details. It’s much easier to adopt a database service because you don’t have to provision hardware.
Traceability, lineage, governance are key. The graph model is able to represent the open-ended complex pictures using nodes and relationships in a node model. Keep track of meta lineage but all the different identifiers for the individual, his devices, and identities. We are seeing the rise of the chief data officer and governance with GDPR and California initiative. Not unlike a data warehouse where you get the data you need based on the requirements you have. See how pieces of data correlate across the entire enterprise. What kind of data pieces do you want to see correlated and what kind of relationships do you want to discover?
Many companies need to have a better/more accurate understanding of how they expect their data to scale and what the projected growth rate is going to be. Granted, it can be hard to get perfectly right. (You might start with a small environment, get customers faster than anticipated, and blow out projections.) But make sure you have a way to understand what data you are collecting and what the volume is going to be – from the beginning! – cannot be overstressed. With Apache Cassandra, for example, it’s fairly easy to scale, but it’s not particularly fast to do so. You need to plan deployment with enough runway…if you hit limits, you’re going to have problems.
Although we are built to handle and scale high volumes of data, one of the first steps is always to get a clear picture of which data points are really important. The value of data is also changing with the location (e.g. cloud vs. edge) and over time. Exploring and learning from (and with) the data is an important part of ongoing success.
Use data platforms that allow you to work naturally with data of any shape or structure in its native form without having to constantly wrangle with a rigid schema. The ability to scale out on commodity infrastructure and do it across geographic regions to accommodate massive increases in data volume.
That’s why we developed a platform to handle the scale and diversity of data. Edge to cloud is a common use case with initial processing at the edge and then moving the data to data centers. Once the data is in a central location that’s where you can do ML, come up with models, and push insight back to the edge. When you have datasets like that, that’s where the database and streaming fits in with fast streaming and fast processing you need a platform with different data services to meet all of your needs.

Here are the contributors of insight, knowledge, and experience:

May 20, 2019

Seeing in the Dark: The Future of Automation With Unstructured Data

About 2.5 quintillion bytes of data are created every day according to Forbes magazine. And about 90 percent of that is unstructured data — video, audio, image, email, instant messaging, and other types.

This "dark data" creates a major headache for organizations. 80 percent of business processes today rely heavily on people to locate, organize, and input unstructured data before the process can even begin.

April 19, 2019

A Data Wave in the Cloud: Reconsidering Cloud Security

When organizations originally started to move data to the cloud in a meaningful way, the security conversation usually centered around one tactic – access. After all, if you could ensure that only the right people had access to a particular cloud, your data would be safe, right? Not quite.

As we continue to see an increase in data breaches impacting data stored in the cloud, it’s clear that access, in and of itself, isn’t the silver bullet solution. If bad actors want to get your data, they will find a way — they study each new access control technology until they find its vulnerability. Continuing to simply apply another control that hackers will again unlock is a never-ending, no-win prospect — and not cost-effective. In fact, though organizations pour more money into data security, breaches continue to increase.

January 24, 2019

Unstructured Data Is an Oxymoron

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.