E-Handbook: Data lake concept needs firm hand to pay big data dividends Article 2 of 3

michelangelus - Fotolia

Tip

7 steps to a successful data lake implementation

Flooding a Hadoop cluster with data that isn't well organized and managed can stymie analytics efforts. Take these steps to help make your data lake accessible and usable.

David Loshin

By

David Loshin, Knowledge Integrity Inc.

Published: 08 Oct 2019

The concept of the data lake originated with big data's emergence as a core asset for companies and Hadoop's arrival as a platform for storing and managing the data. However, blindly plunging into a Hadoop data lake implementation won't necessarily bring your organization into the big data age -- at least, not in a successful way.

That's particularly true in cases where data assets of all shapes and sizes are funneled into a Hadoop environment or another big data repository in an ungoverned manner. A haphazard approach of this sort leads to several challenges and problems that can severely hamper the use of a data lake to support big data analytics applications.

For example, you might not be able to document what data objects are stored in a data lake or their sources and provenance. That makes it difficult for data scientists and other analysts to find relevant data distributed across a Hadoop cluster and for data managers to track who accesses particular data sets and determine what level of access privileges are needed on them.

Organizing data and "bucketing" similar data objects together to help ease access and analysis is also challenging if you don't have a well-managed process.

None of these issues have to do with the physical architecture of the data lake or the underlying environment, whether that's the Hadoop Distributed File System or a cloud object store like Amazon Simple Storage Service -- or a combination of those technologies, each containing different types of data. Rather, the biggest impediments to a successful data lake implementation result from inadequate planning and oversight on managing data.

Data lake vs. data warehouse — The difference between data lakes and data warehouses

Do what needs doing with Hadoop data

The good news, however, is the challenges are easily overcome. Here are seven steps to address and avoid them:

Create a taxonomy of data classifications. Organizing data objects in a data lake depends on how they're classified. Identify the key dimensions of the data as part of your classifications, such as data type, content, usage scenarios, groups of possible users and data sensitivity. The latter relates to protecting both personal and corporate data -- such as personally identifiable information on customers in the first case and intellectual property in the second.
Design a proper data architecture. Apply the defined classification taxonomy to direct how the data is organized in your Hadoop environment. The resulting plan should include things like file hierarchy structures for data storage, file and folder naming conventions, access methods and controls for different data sets, and mechanisms for guiding data distribution.
Employ data profiling tools. In many cases, the absence of knowledge about all the data going into a data lake can be partially alleviated by analyzing its content. Data profiling tools can help by gathering information about what's in data objects, thereby providing insight for classifying them. Profiling data as part of a data lake implementation also aids in identifying data quality issues that should be assessed for possible fixes to make sure data scientists and other analysts are working with accurate information.
Standardize the data access process. Difficulties in effectively using data sets stored in a Hadoop data lake often stem from the use of a variety of data access methods, many undocumented, by different analytics teams. Instead, instituting a common and straightforward API can simplify data access and ultimately allow more users to take advantage of the data.
Develop a searchable data catalog. A more insidious obstacle to effective data access and usage is prospective users being unaware of what's in a data lake and where different data sets are located in the Hadoop environment, in addition to information about data lineage, quality and currency. A collaborative data catalog allows these -- and other -- details about each data asset to be documented. For example, it captures structural and semantic metadata, provenance and lineage records, info on access privileges and more. A data catalog also provides a forum for groups of users to share experiences, issues and advice on working with the data.
Implement sufficient data protections. Aside from the conventional aspects of IT security, such as network-perimeter defenses and role-based access controls, utilize other methods to prevent the exposure of sensitive information contained in a data lake. That includes mechanisms like data encryption and data masking, along with automated monitoring to generate alerts about unauthorized data access or transfers.
Raise data awareness internally. Finally, make sure that the users of your data lake are aware of the need to actively manage and govern the data assets it contains. Train them on how to use the data catalog to find available data sets and how to configure analytics applications to access the data they need. At the same time, impress upon them the importance of proper data usage and strong data quality.

To meet the ultimate objective of making a data lake accessible and usable, it's crucial to have a well-designed plan for dealing with the data prior to migrating it into your Hadoop environment or cloud-based big data architecture. Taking the steps outlined here will help streamline the data lake implementation process. More important, the right combination of planning, organization and governance will help maximize your organization's investment in a data lake and reduce the risk of a failed deployment.

Dig Deeper on Data management strategies

E-Handbook: Data lake concept needs firm hand to pay big data dividends

Article2 of 3

Up Next

Data management mistakes can ruin your data lake journey

Data lakes pose technology deployment and data management challenges that can leave analytics users high and dry if the implementation process isn't handled properly.

7 steps to a successful data lake implementation

Flooding a Hadoop cluster with data that isn't well organized and managed can stymie analytics efforts. Take these steps to help make your data lake accessible and usable.

Three ways to turn old files into Hadoop data sets in a data lake

Hadoop data lakes offer a new home for legacy data that still has analytical value. But there are different ways to convert the data for use in Hadoop depending on your analytics needs.

Search Business Analytics

Master these skills to get the right data scientist role
Data science offers many professional opportunities. Balance education and experience to present yourself as an adaptable and ...
Best practices for using simulation models in business
Simulation models provide businesses with a framework for forecasting and strategy through tested practices in finance, ...
Incorta unveils features as part of new focus on AI
The vendor's new AI-focused features include a Model Context Protocol layer for developing agents and prebuilt agentic AI ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Search Content Management

How to create a digital signature in Adobe, Preview or Word
Business executives can use different tools and methods to get digital signatures to close deals, but some important security ...
How to conduct a content audit: Step-by-step with template
To conduct a content audit, enterprises should follow eight comprehensive steps to improve their content performance, audience ...
Contentsquare previews agentic AI-driven content analytics
Content analytics can be complicated, but natural language and agentic AI can simplify the process.

Search Oracle

Click-to-launch tools pull apps through Oracle Cloud Infrastructure marketplace
Oracle has made it easier for customers to choose and launch third-party software onto its cloud. Now, the question is whether ...
Willis develops app to put a personal touch back in voluntary benefits
Part two of a two-part article: Willis uses PeopleSoft 9.1 to bring back the personal feel to automated insurance selection for ...
Willis develops app for real-time voluntary benefit selection
Part one of a two-part article: Willis uses PeopleSoft 9.1 to create real-time automated insurance selection for voluntary ...

Search SAP

SAP pitches role-based Joule assistants as ERP work partners
New AI-driven applications for supply chain, procurement and CX also shared the spotlight as SAP strives to portray its broad ...
There are '50 shades of clean core' for SAP customers
In this Q&A, Michael Lemashov and Denis Malov of JDC Group discuss the strategies for SAP customers to achieve a clean core and ...
AI tackles customizations in SAP clean core migrations
New tools from Lemongrass and Kyndryl are designed to find custom code and figure out what to do with it when moving to a clean ...

Close