To meet the needs of enterprises that deploy Hadoop, and help them with their big data requirements, vendors have developed commercial distributions of Hadoop and related open source technologies. Here are some of the top Hadoop distribution vendors, as of this writing.
Alibaba Cloud E-MapReduce
Alibaba Cloud Elastic MapReduce, also known as E-MapReduce or EMR, is a big data processing Hadoop distribution that runs on the Alibaba Cloud platform and facilitates the processing and analysis of vast amounts of data. Built on Alibaba Cloud Elastic Compute Service instances, EMR is based on Apache Hadoop and Apache Spark.
EMR manages organizations' data in a wide range of scenarios, such as trend analysis, data warehousing and online and offline data processing. EMR enables companies to use the Hadoop and Spark ecosystem components -- such as Apache Hive, Apache Kafka, Flink, Druid and TensorFlow -- to analyze and process data.
It also simplifies the process of importing and exporting data to and from other cloud storage systems and database systems, such as Alibaba Cloud Object Storage Service and Alibaba Cloud Distributed Relational Database Service. Most users on the Gartner peer review website seemed to like the product, partly because the implementation was easy. The product allows users to "ingest, structure and analyze information," according to the Alibaba website, and also offers ways to manage clusters. However, one user on the Gartner review site said the platform is too complicated and doesn't work.
Features of Alibaba Cloud EMR include the following:
- Automated cluster deployment and expansion: This allows users to deploy and expand clusters from a web interface without having to manage the hardware and software, and deploy various types of clusters, such as Hadoop, Kafka, Druid and ZooKeeper. They can also add, configure and maintain components based on company needs, and add any type of nodes to existing clusters.
- Workflow scheduling: This feature offers simple job orchestration and scheduling. It supports graphical job editing and management to enable businesses to execute and orchestrate different types of jobs, and it supports job and dependency scheduling. Organizations can orchestrate and schedule jobs as DAG (directed acyclic graph)-based workflows.
- Multiple components: Alibaba Cloud EMR includes Hadoop, Spark, Hive, Kafka and Storm.
- Complete ecosystem support: The tool supports reading and writing data from Alibaba Cloud message services, including Message Queue and Message Service, and supports SDK integration.
- Data integration: Elastic MapReduce integrates with open source, offline, real-time, and Alibaba Cloud-developed data integration tools.
Amazon EMR is an Amazon Web Services (AWS) tool for big data processing and analysis. Amazon EMR offers the expandable low-configuration service as an alternative to running in-house cluster computing.
Amazon EMR clusters start with a foundation of big data frameworks, such as Apache Hadoop or Apache Spark. These frameworks typically couple with open source utilities, such as Apache Hive and Apache Pig.
When used together, these big data frameworks can process, analyze and transform vast quantities of data, and interact with AWS databases and storage resources, such as Amazon DynamoDB and Simple Storage Service. This integration between AWS tools helps IT teams manage, store and glean insights from pools of big data.
Using Amazon EMR, organizations can instantly provision as much or as little capacity as they want to perform data-intensive tasks for such applications as web indexing, log file analysis, machine learning, data mining, financial analysis, scientific simulation and bioinformatics research.
EMR Notebooks offer a managed environment, based on Jupyter Notebook, that helps analysts, developers and data scientists prepare and visualize data, build applications, collaborate with peers and do interactive analysis using EMR clusters. Amazon EMR also allows organizations to provision as much capacity as they need, and gives them the option to automatically or manually add and remove capacity.
Some users on the TrustRadius website contend that while the machine learning capabilities of EMR (using the big data tools of Hadoop/Spark) are good, they're not as easy to use as other machine learning tools.
Microsoft Azure HDInsight is a cloud-based managed service that is built on the Hadoop-ecosystem components from the Hortonworks Data Platform (HDP) distribution. Azure HDInsight aims to help users deploy and use Hadoop and other Apache big data analysis and processing products more cost-effectively.
To support the service, Microsoft utilizes its managed Azure cloud infrastructure, enabling users to provision Hadoop clusters without having to purchase, install and configure the necessary hardware and software.
Enterprises can use the most popular open source frameworks, including Hadoop, Spark, Hive, LLAP, Kafka, Storm, MapReduce and more, to enable a number of scenarios, such as extract, transform and load (ETL), data warehousing, machine learning and internet of things.
HDInsight supports the latest open source projects from the Apache Hadoop and Spark ecosystems. The product integrates with a wide range of Azure data stores and services, including SQL Data Warehouse, Azure Cosmos DB, Data Lake Storage, Blob Storage, Event Hubs and Data Factory, for building comprehensive analytics pipelines.
Some users said that while Azure HDInsights is a great product, it is a little training intensive. "It usually took so long to teach clients how to use it that it was easier to simply control it for them," one user said on the G2 website.
The Cloudera Hadoop distribution, now known simply as CDH, is the core of Cloudera Enterprise. It includes Apache Hadoop, Apache Spark, Apache Kafka and more than a dozen other leading open source projects, all tightly integrated. Built specifically to meet the demands of enterprises, CDH offers the core elements of Hadoop, i.e., scalable storage and distributed computing along with a web-based user interface and enterprise capabilities. CDH is Apache-licensed open source and offers unified batch processing, interactive SQL and interactive search, as well as role-based access controls.
CDH provides an analytics platform and open source technologies to store, process, discover, model and serve large amounts of data.
Features of the software include the following:
- store and deliver structured and unstructured data;
- bring diverse analytics to shared data, including machine learning, batch and stream processing and analytic SQL;
- enable the same platform across hybrid- and multi-cloud deployment environments; and
- process data in parallel and in place with linear scalability.
The Impala framework in CDH allows users to execute interactive SQL queries directly against data stored in the Hadoop Distributed File System (HDFS), Apache HBase or the Amazon Simple Storage Service. Impala uses several technologies and components from Hive, including SQL syntax (Hive SQL), Open Data Base Connectivity driver and Impala's Query UI (Hue is also used by Hive).
An integrated part of CDH and supported via a Cloudera Enterprise subscription, Impala is the open source, analytics, massively parallel processing (MPP) database for Apache Hadoop.
Customers on the G2 website said that CDH is user friendly and a very good tool to store and maintain data in the cloud.
Google Cloud Dataproc
Google Cloud Dataproc is a fully managed cloud service for running Apache Spark and Apache Hadoop clusters. Operations that used to take hours or days now only take seconds or minutes, the company claims.
Cloud Dataproc also integrates with other Google Cloud Platform services, giving enterprises a complete platform for data processing, analytics and machine learning.
Cloud Dataproc features include the following:
- Automated cluster management: This allows managed deployment, logging and monitoring.
- Resizable clusters: Companies choose how to create and scale their clusters, with options like virtual machine types, disk sizes, number of nodes and networking.
- Integration: Cloud Dataproc has built-in integration with cloud storage, BigQuery, Bigtable, Stackdriver Logging and Stackdriver Monitoring.
- Versioning: Image versioning allows users to switch between different versions of Apache Hadoop, Apache Spark and other tools.
- High availability: Organizations can run clusters with multiple master nodes and set jobs to restart when they fail.
- Developer tools: Cloud Dataproc offers multiple ways to manage a cluster, including web UI, the Google Cloud SDK, RESTful APIs and SSH (Secure Shell) access.
- Initialization actions: Users can run initialization actions to install or customize the settings and libraries they need for their newly created clusters.
- Automatic or manual configuration: The tool automatically configures hardware and software on clusters for users while also allowing for manual control.
Some customers on the G2 website said that Dataproc is the best for running Apache and Spark in the cloud in a fast, fully managed and cost-effective way.
Editor's note: Using extensive research into the Hadoop market, TechTarget editors focused on the vendors that lead in market share, plus those that offer traditional and advanced functionality. Our research included data from TechTarget surveys, as well as reports from other respected research firms, including Gartner and Forrester.
Hortonworks Data Platform
After Cloudera finalized its merger with Hortonworks in January 2019, the company announced plans to develop a unified offering called Cloudera Data Platform. Even after Cloudera launches that product, though, it will continue to develop the existing Cloudera and Hortonworks platforms and support them at least until January 2022, according to company executives.
The Hortonworks Data Platform (HDP) consists entirely of projects built through the Apache Software Foundation and provides an open source environment for data collection, processing and analysis.
HDP enables users to store, process and analyze massive volumes of data from many sources and formats. At its core, the scalable open enterprise Hadoop distribution includes HDFS, a fault-tolerant storage system for processing large amounts of data in a variety of formats, and Apache Hadoop YARN.
YARN, a core part of the open source Hadoop project, provides centralized resource management and job scheduling for Hadoop data processing workloads across various processing methods, including interactive SQL, real-time streaming, data science and batch processing.
The latest version of HDP 3.1.0 also includes enterprise-grade capabilities for agile application deployment, new machine learning/deep learning workloads, real-time data warehousing, security and governance. In addition, this new version of HDP enables enterprises to gain value more quickly and efficiently from their data in a hybrid environment, the company touts.
The modern hybrid data architecture includes cloud storage support to store data in its native format, including Azure Data Lake Storage, Azure Blob Storage, Amazon S3 and Google Cloud Storage (tech preview), as well as data-in-transit and data-at-rest and support on premises and in the cloud.
Some customers on the Gartner peer review website said that the product has many small bugs that the product team is aware of and can resolve in a timely manner. Others said that HDP clusters are difficult to implement and set up, especially in large companies.
MapR is an enterprise-grade distribution of Apache Hadoop and other big data technologies. The MapR Data Platform offers speed, scale and reliability to drive analytical and operational workloads in one platform.
The MapR Data Platform supports big data storage and processing through the Apache collection of open source technologies, as well as its own added-value components. These components from MapR Technologies provide several enterprise-grade proprietary tools to better manage and ensure the resiliency and reliability of data in Hadoop clusters, MapR claims.
These platform components include MapR XD Distributed File and Object Store, a file system originally known as MapR-FS that the company uses instead of HDFS; MapR Database, an alternative to Hadoop's companion HBase database; and MapR Control System, the product's user interface.
MapR supports all Hadoop APIs and Hadoop data processing tools to access Hadoop data. Organizations can easily move data from MapR to other distributions, and vice versa. The MapR Hadoop distribution also includes a complete implementation of the Hadoop APIs, enabling the product to be fully compatible with the Hadoop ecosystem.
MapR Snapshots offers improved data protection by capturing point-in-time snapshots for both files and tables on demand, as well as at regularly scheduled intervals. In addition, MapR offers out-of-the-box business continuity and disaster recovery services with simple-to-configure mirroring that supports disaster recovery.
MapR also includes MapR Event Store for Apache Kafka, an event streaming system initially called MapR Streams that's designed to support highly scalable real-time streaming of big data, from producers to consumers on their converged platforms.
The distribution provides a sandbox version, a self-contained virtual machine that includes tutorials and demo applications, to help users get started quickly with Hadoop and Spark.
Some users on the Gartner peer review website said that while the MapR ecosystem is good, the user group is relatively small. In addition, there are not many troubleshooting tips online.
Qubole Data Service (QDS) offers a self-managing and self-optimizing implementation of Apache Hadoop.
QDS is a cloud-native platform that offers deep analytics, artificial intelligence (AI) and machine learning for big data. Qubole provides easy-to-use end user tools, such as SQL query tools, notebooks and dashboards that utilize open source engines.
Qubole provides a single, shared infrastructure that lets users conduct ETL, analytics and AI or machine learning workloads across open source engines, such as Apache Spark, TensorFlow, Presto, Airflow, Hadoop, Hive and more.
Qubole enables customers to access, configure and monitor their big data clusters in any cloud and enables users to get self-service access to data using whatever interface they choose.
Users can query the data through the web-based console in the programming language of their choice, build integrated products using the REST API, use the SDK to build applications with Qubole, and connect to third-party business tools through Open Database Connectivity or Java Database Connectivity.
Qubole simplifies the operational side of running Spark clusters and jobs, both scheduled and ad hoc, according to users.
"It's a solid choice if you want some of the most popular big data tools and don't want to spend time maintaining them yourself," one user said on the G2 website.