The realities of enterprise data lakes: The hype is over

Joy King

Vertica, a Micro Focus Product Group

For the last decade, we have seen an interest expand to an obsession: grab the data, store the data, keep the data. The software industry saw an opportunity to capitalize on this obsession, leading to an explosion of big data open source technologies, like Hadoop, as well as proprietary storage platforms advertising their value as “data lakes,” “enterprise data hubs” and more. In a growing number of industries, the goal has been achieved: Ensure you have as much data as possible and keep it for as long as possible.

Data is the new oil, but mining for value requires lots of pipes

Now comes the next phase of any hype cycle: reality. Data is indeed the new oil or the new gas, but none of this matters if value cannot be mined from the data. The oil and gas industry has an advantage. In each identified location, an oil well is created by drilling a long hole into the earth and a single steel pipe (casing) is placed in the hole, allowing the oil to be extracted. When the oil is extracted, it is processed and then brought to market. No integration with other oil repositories is necessary. Unfortunately, this is not the case when drilling for business value in individual data lakes and data hubs.

The manufacturing industry, and specifically manufacturing plants, is one of the most complex examples of how data is collected but is limited in value. Each plant collects its own data and, in some cases, stores that data in public or private clouds. The plant can (sometimes) use that data to optimize its own environment, understand what is happening and maybe even predict what is going to happen. But what about the trends, the insights and the continuous improvement practices that could benefit the multiple and widely distributed manufacturing plants for a large enterprise? What about the optimization between manufacturing, inventory management, supply chain and distribution? All of these groups have their own data, but a single pipe can’t reach it.

Beware of centralizing the data

So, what is the solution? Some would say that it’s critical to centralize the data, to ensure that it is co-located in a single public cloud object store or a centralized data warehouse. But the 1980s are well behind us. Nevertheless, this approach is gaining some attention in the market from some of the cloud vendors and cloud-specific systems. They have a great motivation to reach for the data because of a newer and very dangerous term: data egress. Getting data into a central location is not easy, but it is doable. Getting data out of a single cloud or solution provider is very, very difficult and expensive because once the data is within a single environment, the vendor has control. The reality of distributed data is what we have to address, and this requires a completely different approach. The new reality is bringing the analytics to the data where it resides and in what format it needs to be but ensuring that this does not result in a tangled mess of pipes.

Deriving business value with analytics

Successful industry disruptors focus on the business value derived from the analytical insights from their data, not simply the data collection. They each start and achieve an end goal in mind with a unified analytics platform that respects the data format and the data location, and applies a consistent and advanced set of analytical functions without demanding unnecessary, expensive and time-consuming data movement. A unified analytics platform is also open to integration within a broader ecosystem of applications, ETL tools, open source innovation and, perhaps most importantly, security and encryption technologies. On top of it all, a unified analytics platform delivers the performance needed for the scale of data that is the new normal in today’s world.

The hype cycle of data lakes is over, and the reality and the risk of data swamps are real. Combined with the confusion and uncertainty regarding the future of Hadoop, the time is now to architect — or rearchitect. And it’s imperative to start with the right end goal in mind: how to mine the data in a unified, protected and location-independent way without creating delays that undermine the business outcome.

All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.