Buyer's Handbook: Investigate Hadoop distributions for your organization Article 2 of 4

Sergey Nivens - Fotolia

Feature

Explore Hadoop distributions to manage big data

Discover the uses of Hadoop distributions and the first steps in evaluating these products, as well as how the merger of rivals Cloudera and Hortonworks affects the market.

David Loshin

By

David Loshin, Knowledge Integrity Inc.

Published: 25 Feb 2019

Hadoop is an open source technology that is the data management platform most commonly associated with big data distributions today. Its creators designed the original distributed processing framework in 2006 and based it partly on ideas that Google outlined in a pair of technical papers.

Yahoo became the first production user of Hadoop that year. Soon, other internet companies, such as Facebook, LinkedIn and Twitter, adopted the technology and began contributing to its development. Hadoop eventually evolved into a complex ecosystem of infrastructure components and related tools that several vendors package together in commercial Hadoop distributions.

Running on clusters of commodity servers, Hadoop offers users a high-performance, low-cost approach to establishing a big data management architecture to support advanced analytics initiatives.

As awareness of Hadoop's capabilities has increased, its use has spread to other industries for both reporting and analytical applications involving a mix of traditional structured data and newer forms of unstructured and semi-structured data. This includes web clickstream data, online ad information, social media data, healthcare claims records, and sensor data from manufacturing equipment and other internet of things devices.

What is Hadoop?

The Hadoop framework encompasses a large number of open source software components with a set of core modules to capture, process, manage and analyze massive volumes of data that are surrounded by a variety of supporting technologies. The core components include:

The Hadoop Distributed File System (HDFS): Supports a conventional hierarchical directory and file system that distributes files across the storage nodes -- i.e., DataNodes -- in a Hadoop cluster.
YARN (short for the good-humored Yet Another Resource Negotiator): Manages job scheduling and allocates cluster resources to running applications, arbitrating among them when there's contention for the available resources. It also tracks and monitors the progress of processing jobs.
MapReduce: A programming model and execution framework for parallel processing of batch applications.
Hadoop Common: A set of libraries and utilities that the other components utilize.
Hadoop Ozone and Hadoop Submarine: Newer technologies that offer users an object store and a machine learning engine, respectively.

In Hadoop clusters, those core pieces and other software modules layer on top of a collection of computing and data storage hardware nodes. The nodes connect via a high-speed internal network to form a high-performance parallel and distributed processing system.

As a collection of open source technologies, no single vendor controls Hadoop; rather, the Apache software foundation manages its development. Apache offers Hadoop under a license that grants users a no-charge, royalty-free right to use the software.

Developers and other users can download the software directly from the Apache website and build Hadoop environments on their own. However, Hadoop vendors provide prebuilt, community versions with basic functionality that users can also download at no charge and install on a variety of hardware platforms. The vendors also market commercial -- or enterprise -- Hadoop distributions that bundle the software with different levels of maintenance and support services.

In some cases, vendors also offer performance and functionality enhancements over the base Apache technology -- for example, by providing additional software tools to ease cluster configuration and management or data integration with external platforms. These commercial offerings make Hadoop increasingly more attainable for companies of all sizes.

This is especially valuable when the commercial vendor's support services team can jump-start a company's design and development of their Hadoop infrastructure. It is also helpful to guide the selection of tools and the integration of advanced capabilities to deploy high-performance analytical systems to meet emerging business needs.

The components of a typical Hadoop software stack

What do you actually get when you use a commercial version of Hadoop? In addition to the core components, typical Hadoop distributions will include -- but aren't limited to -- the following:

Alternative data processing and application execution managers, such as Spark, Kafka, Flink, Storm or Tez, that can run on top of or alongside YARN to provide cluster management, cached data management and other means of improving processing performance.
Apache HBase: A column-oriented database management system modeled after Google's Bigtable project that runs on top of HDFS.
SQL-on-Hadoop tools, such as Hive, Impala, Presto, Drill and Spark SQL, that provide varying degrees of compliance with the SQL standard for direct querying of data stored in HDFS.
Development tools, such as Pig, that help developers build MapReduce
Configuration and management tools, such as ZooKeeper or Ambari, that are useful for monitoring and administration.
Analytics environments such as Mahout, which supplies analytical models for machine learning, data mining and predictive analytics.

Because the software is open source, companies don't have to purchase a Hadoop distribution as a product, per se. Instead, the vendors sell annual support subscriptions with varying service-level agreements (SLAs). All of the vendors are active participants in the Apache Hadoop community, although each may promote its own add-on components that it contributes to the community as part of its Hadoop distribution.

Who manages the Hadoop big data management environment?

It's important to recognize that getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals who collaborate on architecture planning, design, development, testing, deployment, and ongoing operations and maintenance to ensure peak performance. Those IT teams typically include:

requirements analysts to assess the system performance requirements based on the types of applications that will run in the Hadoop environment;
system architects to evaluate performance requirements and design hardware configurations;
system engineers to install, configure and tune the Hadoop software stack;
application developers to design and implement applications;
data management professionals to prepare and run data integration jobs, create data layouts and perform other management tasks;
system managers to ensure operational management and maintenance;
project managers to oversee the implementation of the various levels of the stack and application development work; and
a program manager who oversees the implementation of the Hadoop environment and prioritization, development and the deployment of applications.

The Hadoop software platform market

The evolution of Hadoop as a viable, large-scale data management ecosystem has also created a new software market that's transforming the business intelligence and analytics industry. This has expanded both the kinds of analytics applications that user organizations can run and the types of data that the companies can collect and analyze as part of those applications.

The market now includes two major independent vendors that specialize in Hadoop -- Cloudera Inc. -- Cloudera and Hortonworks merged in October 2018 to form this new company -- and MapR Technologies Inc. Other companies that offer Hadoop distributions or capabilities include cloud platform market leaders AWS, Google and Microsoft, which uses Hortonworks as part of a big data distributions managed service.

Getting the desired performance out of a Hadoop system requires a coordinated team of skilled IT professionals.

Over the years, the Hadoop market has matured -- and consolidated -- significantly. IBM, Intel and Pivotal Software all dropped out of the market, but the combination of Cloudera and Hortonworks is the biggest change for users to date. The merger of the former rivals gives the new Cloudera a larger share of the market and could enable it to compete more effectively in the cloud.

In fact, Cloudera's new messaging is that it will deliver "the industry's first enterprise data cloud" -- an indication of its desire to compete with the AWS, Microsoft Azure and Google clouds.

Cloudera plans to develop a unified offering called the Cloudera Data Platform, although it hasn't said when it will become available. In the meantime, the company will continue to develop the existing Cloudera and Hortonworks platforms and support them until at least January 2022.

Although the new Cloudera may be more competitive, a potential downside to the merger is that Hadoop users now have fewer options. That's why it's even more critical to evaluate the vendors that provide Hadoop distributions and understand the similarities and differences between the two primary aspects of the product offerings.

First is the technology itself: what's included in the different distributions, what platforms are they supported on, and, most importantly, what specific components do the individual vendors support?

Second is the service and support model: what types of support and SLAs do vendors provide within each subscription level, and how much do different subscriptions cost?

Understanding how these aspects relate to your specific business requirements will highlight the characteristics that are important for a vendor relationship.

Linda Rosencrance contributed to this report.

Dig Deeper on Data management strategies

Buyer's Handbook: Investigate Hadoop distributions for your organization

Article2 of 4

Up Next

Hadoop software distributions help manage big data

Hadoop distributions help organizations manage mass volumes of data. It is important to research options, features and vendors before you make a final buying decision.

Explore Hadoop distributions to manage big data

Discover the uses of Hadoop distributions and the first steps in evaluating these products, as well as how the merger of rivals Cloudera and Hortonworks affects the market.

4 factors to consider in a Hadoop distributions comparison

Examine the key characteristics necessary to evaluate in a Hadoop distribution comparison, focusing on enterprise features, subscription options and deployment models.

The main picks for Hadoop distributions on the market

Check out the current top Hadoop distribution vendors in the market to help you determine which product is best for your company.

Search Business Analytics

Qlik launches data engineering tools to aid AI development
New capabilities, such as data quality agents and a feature that makes data products more reusable, support engineers to help ...
Tableau, Qlik in flux: What it means for BI users amid AI shift
Ongoing transitions at Tableau and Qlik highlight how the rise of AI-driven analytics is changing the roles and technology needs ...
Upskilling is key to building AI-driven data teams
External hires and consultants can help, but strengthening the current data team for AI work often leads to quicker progress, ...

Search Enterprise AI

To make policy, policymakers should use AI
Too often, policymakers base their organization's policies for AI usage on what they read in the news instead of first-hand ...
AI and robotics yield bumper crops down on the farm
Autonomous tractors roam the fields 24/7, while AI, computer vision and machine learning harvest fruits, increase milk production...
Bans on AI layoffs: Current laws and what might come next
An appellate court in China ruled that employers cannot cite AI as a reason for terminating employees. Is similar legislation ...

Search Oracle

Click-to-launch tools pull apps through Oracle Cloud Infrastructure marketplace
Oracle has made it easier for customers to choose and launch third-party software onto its cloud. Now, the question is whether ...
Willis develops app to put a personal touch back in voluntary benefits
Part two of a two-part article: Willis uses PeopleSoft 9.1 to bring back the personal feel to automated insurance selection for ...
Willis develops app for real-time voluntary benefit selection
Part one of a two-part article: Willis uses PeopleSoft 9.1 to create real-time automated insurance selection for voluntary ...

Search SAP

SAP Sapphire 2026 news, trends and analysis
Here are the newest developments from SAP Sapphire in Orlando, Fla., with the enterprise software vendor's 2026 announcements and...
Compare SAP greenfield vs. brownfield approach for S/4HANA
Here's an explanation of the key differences between SAP greenfield vs. brownfield, what a third, hybrid approach can do for an S...
At TechEd, SAP continues to lay down the AI data foundation
New tools to speed up agentic AI development, open SAP platforms and provide access to data products were also touted as helping ...

Close