Sergey Nivens - Fotolia
Hadoop distribution provider Hortonworks is expanding technology partnerships with Google, Microsoft and IBM to broaden the options for users looking to deploy Hortonworks cloud systems.
Most notably, Hortonworks now supports the Google Cloud Storage (GCS) service, with the ability to run applications against data stored there. Cloud-based object stores like GCS have gained greater prominence, at times supplanting the Hadoop Distributed File System (HDFS) as a repository for Hadoop-based big data applications in the cloud.
For Google, the expanded deal announced June 18 furthers its efforts to close a gap with cloud platform market leaders Amazon Web Services and Microsoft. For Hortonworks, the move is part of its efforts to enable users to run big data workloads on multiple clouds, according to Ovum analyst Tony Baer.
Baer said that for many organizations -- particularly ones that are a step below the size of the biggest enterprises -- big data analytics will largely be done on the cloud going forward.
"For people just getting started, even with the work done by the distribution providers, Hadoop is a complicated platform with a lot of moving parts," Baer said. "There's a lot of knowledge needed just to set it up, and that is not a skill most organizations have."
When moving big data workloads to the cloud, users often see a money-saving opportunity in cloud storage tools like GCS, the Amazon Simple Storage Service (S3) and Microsoft's Azure Blob Storage. Such technologies may provide slower performance as opposed to HDFS, but Baer said that gap could close with improvements over time. Among users of GCS now are Spotify, Coca-Cola, the Broad Institute and others.
Cold data play
Hortonworks CTO Scott Gnau said interest in cloud object stores doesn't prefigure a complete move away from HDFS for Hortonworks cloud users.
"What we see is customers looking to take advantage of different options," Gnau said. Running applications against data stored natively in GCS or S3 lets users "play the data where it lies without having to move it" to HDFS first, he noted. Object stores are also typically less expensive to use than keeping data in HDFS is, according to Gnau.
However, users are likely to continue using HDFS for Hortonworks cloud applications that require high-performance and sophisticated data analysis, Gnau added. Object storage "has advantages, but it also has difficulties," he said. "It's not as performant as HDFS."
Scott GnauCTO, Hortonworks
As a result, Gnau said he sees the best immediate role for cloud-based object storage in handling "colder data" -- that is, data that isn't an immediate part of an analytics workflow.
Sudhir Hasbe, director of product management for the Google Cloud Platform, said Hortonworks users can now decouple storage and compute by using GCS instead of HDFS. That could make it more cost-effective for on-premises HDFS users to use Hortonworks cloud systems for their big data workloads, he continued.
IBM, Microsoft clouds also in sight
The Google deal complements other Hortonworks cloud pacts with AWS, IBM and Microsoft. Coming on the first day of Hortonworks' DataWorks Summit 2018 conference in San Jose, Calif., the addition of the GCS support was accompanied by updates to the alliances that the big data platform vendor has with IBM and Microsoft.
Hortonworks said organizations can now run its Hortonworks Data Platform (HDP) software natively on the Microsoft Azure cloud, in addition to using the HDP-based Azure HDInsight managed service that Microsoft sells to customers. Hortonworks DataFlow and Hortonworks DataPlane Service, two related technologies offered by the Santa Clara, Calif., company, also are now available for native deployments on Azure.
Meanwhile, in a blog post, Rob Thomas, general manager of IBM Analytics, said IBM is adding a managed service on its cloud platform called IBM Hosted Analytics with Hortonworks, or IHAH. The new service combines HDP with IBM's Db2 Big SQL query engine and Data Science Experience workbench platform, extending a relationship that began last year when IBM dropped its own Hadoop distribution and agreed to resell HDP instead.
In addition to the expanded cloud deals, Hortonworks detailed plans for an HDP 3.0 release that will let users put big data applications in Docker containers to help speed up deployments and make it easier to move processing workloads to different servers. Due out in the third quarter, HDP 3.0 also adds the ability to run deep learning applications on GPU-based systems, plus support for Apache Hive 3.0, an update of the open source SQL query engine and data warehouse environment that was released in May.
Hive 3.0 functions as a real-time database for analytics applications that require fast query response rates, Gnau said. "It really is a database now versus Hive historically being viewed as a SQL programming environment that ran on Hadoop."
Senior executive editor Craig Stedman contributed to this story.