Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop. Users do not have to invest in or install additional infrastructure on premises when using the technology, as HaaS is provided and managed by a third-party vendor.
Hadoop is a software framework used to manage data and storage for big data applications in clustered systems. Hadoop gives users the ability to collect, process and analyze data. HaaS strives to provide the same experience to users in the cloud. HaaS is useful for medium and large scale organizations that do not have the infrastructure or ability to host Hadoop on premises.
The open source Hadoop big data analytics framework allows large, unstructured data sets to be analyzed. Hadoop's storage mechanism, the Hadoop Distributed File System, distributes these workloads across multiple nodes so they can be processed in parallel. Hadoop as a service providers integrate proprietary programs with the Hadoop framework to make it easier for organizations to use and typically include management and support capabilities. Most HaaS offerings are cloud-based, and pricing is most often on a per-cluster, per-hour basis.
HaaS providers offer a variety of features and support, including:
- Hadoop framework deployment support.
- Hadoop cluster management.
- Alternative programming languages.
- Data transfer between clusters.
- Customizable and user-friendly dashboards and data manipulation.
- Security features.
Advantages and disadvantages
Running HaaS can boast a balance of advantages and disadvantages. Advantages of HaaS include:
- Eliminating the need to deploy additional physical hardware infrastructure.
- A wide range of data sources which can be used—including clickstream data or emails.
- Supported functions including fraud detection, data warehousing or automatically making copies of data in case data is lost.
- Seed, in such that the tools which process data are used on the same servers the data is located on—leading data process speeds to increase.
Disadvantages, however, include:
- The Hadoop open source programming language requires a special set of skills many organizations do not have in-house or cannot afford.
- Skilled engineers well rounded in Hadoop are hard to find.
- Hadoop security measures are disabled by default.
- Only medium to large organizations can make efficiant use out of HaaS
One positive and negative that HaaS has are in the services HaaS providers offer in their platforms. HaaS providers can offer a wide variety of features which could include just the Hadoop software or other features such as virtual machines. This verity can be useful for organizations that want to choose their provider based on precisely what they need and what the provider offers, but this may also be initially confusing for an organization just starting to consider HaaS.
HaaS providers and provided features
Amazon was the first major provider of Hadoop as a service. Other providers currently in the market include:
- Amazon Elastic MapReduce.
- Microsoft HDInsight.
- IBM InfoSphere BigInsights.
- Oracle Big Data Discovery Tool.
- OpenStack Savanna.
- Google Cloud Dataproc.
Features to look for in a HaaS provider include:
- Data should be stored persistently in HDFS. This avoids issues associated with translating data stored in other formats into HDFS.
- Elasticity to accommodate a wide variety of workloads.
- Ability to recover from processing failures without restarting the entire process (known as non-stop operations).
- A self-configuring environment that allows automatic configuration based on workload.