E-Handbook: Enterprise data lakes hold the key to actionable insights Article 2 of 4

rolffimages - Fotolia

Feature

Key factors for successful data lake implementation

There are many important parts to a data lake implementation, from technology to governance. Read on for the top factors to evaluate in your implementation strategy.

Chris Foot

By

Chris Foot

Published: 06 Jul 2020

In addition to the business drivers behind the growth of data lakes, the cloud's ability to offer vast amounts of storage and processing power at ever-decreasing price points are making data lake platforms increasingly attractive to organizations of all sizes.

Data lake implementation continues to capture the attention of the IT community. A recent analysis report from Research and Markets forecasts that the data lake market will grow by a 26% compound annual growth rate (CAGR), reaching $20.1 billion by 2024.

If your organization is considering a data lake implementation, here are some things you should consider.

What is a data lake?

An easy way to define and better understand data lakes is to compare them to data warehouses. Although data warehouses and data lakes are both used to store large amounts of data, there are significant differences.

Organizations can use data lake information in many ways, and the data sources do not need a predefined purpose to qualify for ingestion into a data lake. Analysts explore, experiment and evaluate data lake information to identify its benefits and use cases. Meanwhile, data warehouses ingest and store data for a predetermined purpose.

Data warehouse specialists often perform a high level of analysis to evaluate and identify input sources. But the strategy for a data lake implementation is to ingest and analyze data from virtually any system that generates information.

Data warehouses use predefined schemas to ingest data. In a data lake, analysts apply schemas after the ingestion process is complete.

Data lakes store data in its raw form. As a result, data ingestion is a fairly uncomplicated process. In a data warehouse, data is heavily processed during ingestion to ensure it adheres to the schema and its predefined purpose.

Data lakes specialize in ingesting structured, semistructured and unstructured data. They also provide mechanisms to easily ingest streaming data in addition to batch loads. Although data warehouses can accept many different forms of data, they usually ingest structured data using batch loads.

How to get started

The first step in data lake implementation is to learn more about data lake architectures, platforms, products and workflows through vendor websites and other resources.

Like any product evaluation, your organization will need to perform a thorough analysis of the competing offerings. Here is a starter list of evaluation criteria to help your analysis:

Technology. Although Apache Hadoop and its suite of supporting products have been the perennial favorites for many organizations, there are a growing number of alternatives. Many vendors that use Hadoop for their data lake offerings provide their own customizations and edge products to simplify, streamline and facilitate administration and analysis.

There are a wide range of platforms available, including Amazon Data Lake Solutions, Microsoft Azure Data Lake, Google Data Lakes, Snowflake for Data Lakes and Oracle Data Lake.

Security and access control. Data lakes hold a treasure trove of information about your business. Like all enterprise data stores, you will need to protect data lakes against unauthorized access.

Data ingestion. Does the platform easily and quickly ingest structured, semistructured and unstructured data? Is it capable of efficiently ingesting data streams, micro batch and mega batch data loads?

Metadata management. Big data specialists use metadata to search, identify and better understand the data sets that are in the data lake. How does the platform capture and store metadata?

Data processing, performance and scalability. What tools and processes does the platform offer users to interact with the data? How does it enable data exploration? What background processes does it execute during the course of daily operations? How fast are those processes and will they scale to meet your workload requirements?

Management and monitoring. Does the platform provide a strong UI for system administration and monitoring? What workload management capabilities does it offer?

Data governance. Does the platform offer mechanisms to ensure the data is consistent and reliable? Does it provide the ability to create sandbox environments that allow users to experiment with data without affecting the contents of the data lake?

Data analysis and accessibility. What mechanisms does the platform provide to analyze the data? Does it allow you to easily incorporate machine learning? What data analytics features does it offer to consumers? Can you easily integrate third-party analysis tools?

Costing strategies. How will the vendor charge you?

Data lake implementation

After platform selection, the next step is to build the organizational infrastructure, processes and procedures to load, govern, administer and analyze data in the data lake.

These are the key steps in a data lake implantation strategy:

Identify the expertise you need to effectively support the platform and analyze the data. Like many complex technologies, data lakes have a steep learning curve. Hire experienced personnel and train internal staff. Your organization will need to define new organizational roles and reporting structures with data lake implementation.
To execute a well-thought-out data lake implementation strategy and design, your organization will need to develop a traditional project plan with goals, milestones and assigned action items. You will need to identify the criteria your organization will use to evaluate the success of the data lake project. Design the system to foster self-service data analysis. You should also develop data classification standards for data storage and archival.
Virtually any data the organization generates is a potential source for data lake ingestion. The challenge becomes one of prioritization. A good approach is to evaluate the source that generates the data and identify its importance to the organization at a high level.
You should determine if the information is currently being analyzed and the level of analysis that is occurring. Highly analyzed data, although still a potential source for ingestion, may have a lower importance than data from a system that is not being evaluated.
Develop, implement and enforce data governance strategies to ensure the data is secure, complete, consistent and accurate.
Establish standards for data exploration, experimentation and analysis. Data scientists should follow a standardized but flexible process to evaluate the data and identify the use cases that will generate the most value to the business. Potential targets for the data are other BI platforms and new and existing business applications.

Dig Deeper on Data management strategies

E-Handbook: Enterprise data lakes hold the key to actionable insights

Article2 of 4

Up Next

Big data's vast melting pot for business intelligence

Technological pillars of sound business decisions, AI, machine learning and advanced analytics depend on the quantity, quality and integrity of information in data lakes.

Key factors for successful data lake implementation

There are many important parts to a data lake implementation, from technology to governance. Read on for the top factors to evaluate in your implementation strategy.

Data fabrics help data lakes seek the truth

Data fabrics can play a key role in aligning business goals with the integration, governance, reliability and democratization of information collected in massive data lakes.

How to ensure your data lake security

Your data lake is full of sensitive information and securing that data is a top priority. These are the best practices to keep that information safe from hackers.

Search Business Analytics

What makes an effective data science team structure?
Data science team structures vary in strength, and their success depends on how roles and leadership align with business goals to...
Synthetic data vs. real data for predictive analytics
Synthetic data helps simulate rare events and meet privacy compliance, while real data preserves natural variability needed to ...
7 predictive analytics skills to improve simulation modeling
Predictive analytics skills such as statistical analysis, data preprocessing and model evaluation can help data professionals ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Search Content Management

GenAI makes its way into Microsoft Excel, Adobe Acrobat Studio
The new AI features transform documents and spreadsheets into interactive tools that enhance productivity and streamline ...
How to remove digital signatures from a PDF
Digital signatures let organizations execute and secure agreements, but users can remove them if they need to reformat documents ...
The top 10 RFP response software
As B2B organizations grow, the RFP response process can become too time-consuming for manual workflows. Top tools, such as Loopio...

Search Oracle

Click-to-launch tools pull apps through Oracle Cloud Infrastructure marketplace
Oracle has made it easier for customers to choose and launch third-party software onto its cloud. Now, the question is whether ...
Willis develops app to put a personal touch back in voluntary benefits
Part two of a two-part article: Willis uses PeopleSoft 9.1 to bring back the personal feel to automated insurance selection for ...
Willis develops app for real-time voluntary benefit selection
Part one of a two-part article: Willis uses PeopleSoft 9.1 to create real-time automated insurance selection for voluntary ...

Search SAP

AI tackles customizations in SAP clean core migrations
New tools from Lemongrass and Kyndryl are designed to find custom code and figure out what to do with it when moving to a clean ...
SAP agrees to allow Celonis data access until case resolved
SAP agrees to allow Celonis customers to access data from its systems as their legal battle continues, but customers will be best...
Grow with SAP fuels Phoenix Global's digital transition
Phoenix Global implemented S/4HANA Cloud via Grow with SAP to replace outdated systems, digitize manual processes and enable AI ...

Close