Big data as a service is the delivery of data platforms and tools by a cloud provider to help organizations process, manage and analyze large data sets so they can generate insights in order to improve business operations and gain a competitive advantage.
Given the immense amounts of structured, unstructured and semistructured data being generated on a regular basis by many companies, big data as a service (BDaaS) is intended to free up organizational resources by taking advantage of the data management systems and IT skills of an outside provider, rather than deploying on-premises systems and hiring in-house staff for those functions. Big data as a service can take the form of dedicated systems and software running in the cloud or a contract for a managed service that's hosted and operated by a cloud vendor.
BDaaS is a form of cloud computing, similar to software as a service, platform as a service and infrastructure as a service. In addition to the data processing frameworks and associated tools at its core, big data as a service relies upon cloud storage to maintain data sets and provide access to them for the user organization.
Benefits of BDaaS
Initially, most big data systems were installed in on-premises data centers, primarily by large enterprises that combined various open source technologies to fit their particular big data applications and use cases. But deployments have shifted more to the cloud because of its potential advantages. In particular, big data as a service offers the following benefits to users:
- Reduced complexity. Because of their customized nature, big data environments are complicated to design, deploy and manage. Using cloud infrastructure and managed services can simplify the process by eliminating much of the hands-on work that organizations need to do.
- Easier scalability. In many environments, data processing workloads aren't consistent. For example, big data analytics applications often run intermittently or just once. BDaaS makes it easy to scale up systems when processing needs increase and to scale them down again after jobs are completed.
- Increased flexibility. In addition to scaling systems up or down as needed, BDaaS users can more easily add or remove platforms, technologies and tools to meet evolving business requirements than typically is possible in on-premises big data architectures.
- Potential cost savings. Using the cloud may reduce IT costs by enabling businesses to avoid the need to buy new hardware and software and to hire workers with big data management skills. But pay-as-you-go cloud services must be monitored to prevent unnecessary processing expenses from driving up their cost.
- Stronger security. Concerns about data security kept many organizations from adopting the cloud at first, particularly in regulated industries. In many cases, though, cloud vendors and service providers are able to invest in better security protections than individual companies can.
Key elements of BDaaS offerings
The top three cloud platform vendors all offer big data technology bundles and services: Amazon EMR from Amazon Web Services (AWS), Google Cloud Dataproc and Microsoft's Azure HDInsight. Other prominent big data as a service vendors include Cloudera, Databricks, HPE, Oracle and Qubole.
The competing BDaaS platforms provide different combinations of open source big data software. Common core technologies include the Hadoop distributed processing framework, Spark processing engine, Hive data warehouse software and Python, R and Scala programming languages. The following tools often are also included as standard or optional components:
- HBase, Hadoop's companion database;
- Flink, Kafka and other real-time stream processing engines;
- Presto, a rival SQL query engine to Hive;
- the Tez application framework;
- analytical tools such as Jupyter Notebook, Mahout, Pig and Zeppelin; and
- the Oozie workflow scheduler, Sqoop data transfer software, ZooKeeper cluster configuration service and other management tools.
Data typically is stored in the Hadoop Distributed File System (HDFS), which is one of Hadoop's core components, or in cloud object storage services like Amazon Simple Storage Service, Google Cloud Storage and Azure Blob Storage. BDaaS platforms can also connect to data warehouse and data lake environments, such as Azure Data Lake Storage, Delta Lake, Iceberg and Snowflake.
BDaaS market trends
While the big data as a service market is primarily focused on public cloud deployments, users can now install the AWS, Google and Microsoft platforms in their own data centers and other on-premises facilities. That's enabled by added support for running the big data services on each vendor's hybrid cloud platform -- AWS Outposts, Google Anthos and Azure Stack, respectively. Using those technologies, organizations can set up private clouds or mix public cloud and on-premises systems in their big data environments.
All three vendors have also tied their BDaaS platforms to Kubernetes services that enable organizations to use the popular container management framework to create containerized big data applications, which can help simplify deployments, streamline infrastructure management and optimize the use of system resources.
Also, AWS, Google and other BDaaS vendors are now emphasizing Spark and other technologies over Hadoop, which initially was at the center of their offerings and the big data ecosystem as a whole. That reflects a broader decline in Hadoop's standing vs. Spark as a batch processing engine, although HDFS and Hadoop's YARN cluster resource management software continue to be widely used.