Here’s How You Can Purge Big Data From Unstructured Data Lakes

Without a doubt, big data is becoming the biggest data with the passage of time. It’s going above and beyond. Here are some pieces of evidence as to why I said it. 

According to the big data and business analytics report from Statista,  the global cloud data IP traffic will reach approximately 19.5 zettabytes in 2021. Moreover, the big data market will strike a figure of 274.3 billion US dollars with a five-year compound annual growth rate (CAGR) of 13.2% by 2022.  Plus, Forbes predicted that over 150 trillion gigabytes or 150 zettabytes of real-time data will be required by the year 2025. Also, Forbes found that more than 95% of the companies need some assistance for the management of unstructured data, while 40% of the organizations affirmed that they need to deal with big data more habitually. 

Make Analytical Data Available to Everyone by Overcoming the 5 Biggest Challenges [Webinar]

"Data and analytics for all!” — the admirable, new mantra for today’s companies. But it’s not easy to put all of an organization’s analytical data and assets into the hands of everyone that needs it. That’s why embarking on this democratization initiative requires you to be prepared to overcome the five monumental challenges you undoubtedly will face.

Join us for this interactive webcast where we will: explore the recommended components of an all-encompassing, extended analytics architecture; dive into the details of what stands between you and data democratization success, and; reveal how a new open data architecture maximizes data access with minimal data movement and no data copies.

Data Lakes Are Not Just For Big Data

We recently wrote an article debunking common myths about data lake architectures, data lake definitions, and data lake analytics. It is called "What is a Data Lake? Get A Leg Up Avoiding The Biggest Myths." In that article, we framed the current conversation about data lakes and how they fit within enterprise data strategies. This topic has historically been confusing and opaque for those wanting to get value from a data lake due to conflicting advice from consultants and vendors.  

One area that can be particularly confusing is the perception that lakes are only for "big data." If you spend any time reading materials on lakes, you would think there is only one type and it would look like the Capsian Sea (it’s a lake despite “sea” in the name). People describe data lakes as massive, all-encompassing entities, designed to hold all knowledge. The good news is that lakes are not just for "big data" and you have more opportunities than ever to have them be part of your data stack.

Intelligent Big Data Lake Governance

When you have data, and data which is flowing fast with variety into the ecosystem, the biggest challenge is governing that data. In traditional data warehouses, where data is strucured and the structure is always known, creating processes, methods, and frameworks is quite easy. But in a big data environment, where data flows fast while inferring run time schema, the need to govern data is at run time.

When I was working with my team to develop an ingestion pipeline and collecting ideas from the team and other stakeholders on how the ingestion pipeline should be, one idea was common: can we build a system where we can analyze what changed overnight in a feed structure. The second requirement was finding the pattern of the data, e.g. how could we find out that a data element was a SSN numer, a first name, etc., so that we can tag the sensitive information at run time?

An Introduction to Data Virtualization and Its Use Cases

Data virtualization is a solution to address several issues. This type of solution is booming, with strong year-over-year growth. But let's start with a definition first.

Kezako?

Data virtualization is the process of inserting a layer of data access between data sources and data consumers to facilitate access. In practice, we have a kind of SQL requestor as a tool, which is able to query very heterogeneous data sources, ranging from the traditional SQL databases to a text or PDF files, or a streaming source like Kafka. In short, you have data, you can query it, and generate joins between this data. In practice, you can thus offer a unified and complete view of the data, even if it is "exploded" between several systems. On top of that, you have cache and a query optimizer that allows you to minimize the impact on source systems in terms of performance. And, of course, you have a data catalog, which helps you to find your way through all the data in your IT infrastructure. From this we can deduce two main use cases.